SlideShare a Scribd company logo
Nadav Har'El, ScyllaDB
The Generalist Engineer meetup, Tel-Aviv
Ides of March, 2016
SeastarSeastar Or how we implemented a
10-times faster Cassandra
2
● Israeli but multi-national startup company
– 15 developers cherry-picked from 10 countries.
● Founded 2013 (“Cloudius Systems”)
– by Avi Kivity and Dor Laor of KVM fame.
● Fans of open-source: OSv, Seastar, ScyllaDB.
3
Make Cassandra 10 times faster
Your mission, should
you choose to accept it:
4
“Make Cassandra 10 times faster”
● Why 10?
● Why Cassandra?
– Popular NoSQL database (2nd to MongoDB).
– Powerful and widely applicable.
– Example of a wider class of middleware.
● Why “mission impossible”?
– Cassandra not considered particularly slow -
– Considered faster than MongoDB, Hbase, et al.
– “disk is bottleneck” (no longer, with SSD!)

Recommended for you

Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm

Storm is a distributed and fault-tolerant realtime computation system. It was created at BackType/Twitter to analyze tweets, links, and users on Twitter in realtime. Storm provides scalability, reliability, and ease of programming. It uses components like Zookeeper, ØMQ, and Thrift. A Storm topology defines the flow of data between spouts that read data and bolts that process data. Storm guarantees processing of all data through its reliability APIs and guarantees no data loss even during failures.

Ceph and RocksDB
Ceph and RocksDBCeph and RocksDB
Ceph and RocksDB

My short talk at a RocksDB meetup on 2/3/16 about the new Ceph OSD backend BlueStore and its use of rocksdb.

ceph rocksdb storage
Replacing Your Cache with ScyllaDB
Replacing Your Cache with ScyllaDBReplacing Your Cache with ScyllaDB
Replacing Your Cache with ScyllaDB

This document discusses replacing external caching solutions with using the internal caching capabilities of ScyllaDB. It provides examples of companies that improved performance, reduced costs and complexity by moving from Redis or Elasticsearch with an external cache to using ScyllaDB's embedded cache instead. The document also outlines some of the advantages of ScyllaDB's cache like improved latency, coherency with the database and observability compared to external caching layers.

scylladbscyllanosql
5
Our first attempt: OSv
● New OS design specifically for cloud VMs:
– Run a single application per VM (“unikernel”)
– Run existing Linux applications (Cassandra)
– Run these faster than Linux.
6
OSv
●
Some of the many ideas we used in OSv:
– Single address space.
– System call is just a function call.
– Faster context switches.
– No spin locks.
– Smaller code.
– Redesigned network stack (Van Jacobson).
7
OSv
● Writing an entire OS from scratch was a really
fun exercise for our generalist engineers.
●
Full description of OSv is beyond the scope of
this talk. Check out:
– “OSv—Optimizing the Operating System for Virtual
Machines”, Usenix ATC 2014.
8
Cassandra on OSv
● Cassandra-stress, READ, 4 vcpu:
On OSv, 34% faster than Linux
● Very nice, but not even close to our goal.
What are the remaining bottlenecks?

Recommended for you

HBase Storage Internals
HBase Storage InternalsHBase Storage Internals
HBase Storage Internals

Apache HBase is the Hadoop opensource, distributed, versioned storage manager well suited for random, realtime read/write access. This talk will give an overview on how HBase achieve random I/O, focusing on the storage layer internals. Starting from how the client interact with Region Servers and Master to go into WAL, MemStore, Compactions and on-disk format details. Looking at how the storage is used by features like snapshots, and how it can be improved to gain flexibility, performance and space efficiency.

clouderahbasehadoop summit
Ceph
CephCeph
Ceph

This document summarizes a presentation about Ceph, an open-source distributed storage system. It discusses Ceph's introduction and components, benchmarks Ceph's block and object storage performance on Intel architecture, and describes optimizations like cache tiering and erasure coding. It also outlines Intel's product portfolio in supporting Ceph through optimized CPUs, flash storage, networking, server boards, software libraries, and contributions to the open source Ceph community.

Database Performance at Scale Masterclass: Workload Characteristics by Felipe...
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...Database Performance at Scale Masterclass: Workload Characteristics by Felipe...
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...

Felipe Cardeneti Mendes, Solutions Architect at ScyllaDB Navigating workload-specific performance challenges and tradeoffs. Felipe Mendes covers how to navigate the top performance challenges and tradeoffs that you’re likely to face with your project’s specific workload characteristics and technical/business requirements.

scylladbscyllanosql
9
Bottlenecks: API locks
● In one profile, we saw 20% of run on lock()
and unlock() operations. Most uncontended
– Posix APIs allow threads to share
● file descriptors
● sockets
– As many as 20 lock/unlock for each network packet!
● Uncontended locks were efficient on UP (flag to
disable preemption),
But atomic operations slow on many cores.
10
Bottlenecks: API copies
● Write/send system calls copies user data to
kernel
– Even on OSv with no user-kernel separation
– Part of the socket API
● Similar for read
11
Bottlenecks: context switching
● One thread per CPU is optimal, >1 require:
– Context switch time
– Stacks consume memory and polute CPU cache
– Thread imbalance
● Requires fully non-blocking APIs
– Cassandra's uses mmap() for disk….
12
Bottlenecks:
unscalable applications
● Contended locks ruin scalability to many cores
– Memcache's counter and shared cache
● Solution: per-cpu data.
● Even lock-free atomic algorithms are unscalable
– Cache line bouncing
● Again, better to shard, not share, data.
– Becomes worse as core count grows
● NUMA

Recommended for you

From Postgres to ScyllaDB: Migration Strategies and Performance Gains
From Postgres to ScyllaDB: Migration Strategies and Performance GainsFrom Postgres to ScyllaDB: Migration Strategies and Performance Gains
From Postgres to ScyllaDB: Migration Strategies and Performance Gains

How Coralogix shrank query processing times from 30 seconds (not a typo) to 86 ms by moving from Postgres to ScyllaDB.

scylladbsummit2023scylladbsummitscylladb
Improving Apache Spark's Reliability with DataSourceV2
Improving Apache Spark's Reliability with DataSourceV2Improving Apache Spark's Reliability with DataSourceV2
Improving Apache Spark's Reliability with DataSourceV2

DataSourceV2 is Spark's new API for working with data from tables and streams, but "v2" also includes a set of changes to SQL internals, the addition of a catalog API, and changes to the data frame read and write APIs. This talk will cover the context for those additional changes and how "v2" will make Spark more reliable and predictable for building enterprise data pipelines. This talk will include: * Problem areas where the current behavior is unpredictable or unreliable * The new standard SQL write plans (and the related SPIP) * The new table catalog API and a new Scala API for table DDL operations (and the related SPIP) * Netflix's use case that motivated these changes

Spotify: Automating Cassandra repairs
Spotify: Automating Cassandra repairsSpotify: Automating Cassandra repairs
Spotify: Automating Cassandra repairs

Anti-entropy repairs are known to be a very peculiar maintenance operation of Cassandra clusters. They are problematic mostly because of the potential of having negative impact on the cluster's performance. Another problematic aspect is the difficulty of managing the repairs of Cassandra clusters in a careful way that would prevent the negative performance impact. Based on the long-term pain we have been experiencing with managing repairs of nearly 100 Cassandra clusters, and being unable to find a solution that would meet our needs, we went ahead and developed an open-source tool, named Cassandra Reaper [1], for easy management of Cassandra repairs. Cassandra Reaper is a tool that automates the management of anti-entropy repairs of Cassandra clusters in a rather smart, efficient and careful manner while requiring minimal Cassandra expertise. I will have to cover some basics of eventual consistency mechanisms of Cassandra, after which I will be able to focus on the features of Cassandra Reaper and our six months of experience having the tool managing the repairs of our production clusters.

nosqlcassandra summitdatastax
13
Therefore
● Need to provide a better APIs for server
applications
– Not file descriptors, sockets, threads, etc.
● Need to write better applications.
14
Framework
● One thread per CPU
– Event-driven programming
– Everything (network & disk) is non-blocking
– How to write complex applications?
15
Framework
● Sharded (shared-nothing) applications
– Important!
16
Framework
● Language with no runtime overheads or built-in
data sharing

Recommended for you

Performance Update: When Apache ORC Met Apache Spark
Performance Update: When Apache ORC Met Apache SparkPerformance Update: When Apache ORC Met Apache Spark
Performance Update: When Apache ORC Met Apache Spark

Apache Spark 1.4 introduced support for Apache ORC. However, initially it did not take advantage of the full power of ORC. For instance, it was slow because ORC vectorization was not used and push-down predicate wa s also not supported on DATE types. Recently the Apache Spark community has started to use the latest Apache ORC which include new enhancements to address these limitations. In this talk, we show the result of integrating the latest Apache ORC and Apache Spark. We will also review the latest enhancements and roadmap. Speakers: Owen O'Malley, Co-founder & Technical Fellow, Hortonworks Dongjoon Hyun, Staff Software Engineer, Hortonworks

apache sparkdataworks summit 2017dataworks summit sydney
Hadoop Summit 2012 | Optimizing MapReduce Job Performance
Hadoop Summit 2012 | Optimizing MapReduce Job PerformanceHadoop Summit 2012 | Optimizing MapReduce Job Performance
Hadoop Summit 2012 | Optimizing MapReduce Job Performance

Optimizing MapReduce job performance is often seen as something of a black art. In order to maximize performance, developers need to understand the inner workings of the MapReduce execution framework and how they are affected by various configuration parameters and MR design patterns. The talk will illustrate the underlying mechanics of job and task execution, including the map side sort/spill, the shuffle, and the reduce side merge, and then explain how different job configuration parameters and job design strategies affect the performance of these operations. Though the talk will cover internals, it will also provide practical tips, guidelines, and rules of thumb for better job performance. The talk is primarily targeted towards developers directly using the MapReduce API, though will also include some tips for users of higher level frameworks.

map reducehadoop performancemapreduce
HBase Low Latency
HBase Low LatencyHBase Low Latency
HBase Low Latency

This document discusses techniques for improving latency in HBase. It analyzes the write and read paths, identifying sources of latency such as networking, HDFS flushes, garbage collection, and machine failures. For writes, it finds that single puts can achieve millisecond latency while streaming puts can hide latency spikes. For reads, it notes cache hits are sub-millisecond while cache misses and seeks add latency. GC pauses of 25-100ms are common, and failures hurt locality and require cache rebuilding. The document outlines ongoing work to reduce GC, use off-heap memory, improve compactions and caching to further optimize for low latency.

17
Seastar
● C++14 library
● For writing new high-performance server applications
● Share-nothing model, fully asynchronous
● Futures & Continuations based
– Unified API for all asynchronous operations
– Compose complex asyncrhonous operations
– The key to complex applications
● (Optionally) full zero-copy user-space TCP/IP (over DPDK)
● Open source: http://www.seastar-project.org/
18
Seastar linear scaling in #cores
19
Seastar linear scaling in #cores
20
Brief introduction to Seastar

Recommended for you

Ceph RBD Update - June 2021
Ceph RBD Update - June 2021Ceph RBD Update - June 2021
Ceph RBD Update - June 2021

The document summarizes new features and updates in Ceph's RBD block storage component. Key points include: improved live migration support using external data sources; built-in LUKS encryption; up to 3x better small I/O performance; a new persistent write-back cache; snapshot quiesce hooks; kernel messenger v2 and replica read support; and initial RBD support on Windows. Future work planned for Quincy includes encryption-formatted clones, cache improvements, usability enhancements, and expanded ecosystem integration.

ceph-month-2021
How Scylla Make Adding and Removing Nodes Faster and Safer
How Scylla Make Adding and Removing Nodes Faster and SaferHow Scylla Make Adding and Removing Nodes Faster and Safer
How Scylla Make Adding and Removing Nodes Faster and Safer

When a new node is added or removed, Scylla has to transfer part of the existing data from some nodes to their neighbors. When a node fails, Scylla has to repopulate its data with data from the surviving replicas. Those operations are collectively referred to as "streaming" operations, since they simply stream data from one node to another, without using this opportunity to also fix discrepancies in the data. This is in contrast with the repair operation, that looks into all existing replicas and reconcile their contents. Scylla is moving towards unifying those two operations. In this talk we will discuss why this is considered beneficial, and what other possibilities this opens to users.

scyllasummitscylladbscylla
ORC File & Vectorization - Improving Hive Data Storage and Query Performance
ORC File & Vectorization - Improving Hive Data Storage and Query PerformanceORC File & Vectorization - Improving Hive Data Storage and Query Performance
ORC File & Vectorization - Improving Hive Data Storage and Query Performance

Hive’s RCFile has been the standard format for storing Hive data for the last 3 years. However, RCFile has limitations because it treats each column as a binary blob without semantics. The upcoming Hive 0.11 will add a new file format named Optimized Row Columnar (ORC) file that uses and retains the type information from the table definition. ORC uses type specific readers and writers that provide light weight compression techniques such as dictionary encoding, bit packing, delta encoding, and run length encoding — resulting in dramatically smaller files. Additionally, ORC can apply generic compression using zlib, LZO, or Snappy on top of the lightweight compression for even smaller files. However, storage savings are only part of the gain. ORC supports projection, which selects subsets of the columns for reading, so that queries reading only one column read only the required bytes. Furthermore, ORC files include light weight indexes that include the minimum and maximum values for each column in each set of 10,000 rows and the entire file. Using pushdown filters from Hive, the file reader can skip entire sets of rows that aren’t important for this query. Columnar storage formats like ORC reduce I/O and storage use, but it’s just as important to reduce CPU usage. A technical breakthrough called vectorized query execution works nicely with column store formats to do this. Vectorized query execution has proven to give dramatic performance speedups, on the order of 10X to 100X, for structured data processing. We describe how we’re adding vectorized query execution to Hive, coupling it with ORC with a vectorized iterator.

apache hadoophadoop summit 2013big data
21
Sharded application design
● One thread per CPU
● Each thread handles one shard of data
– No shared data (“share nothing”)
– Separate memory per CPU (NUMA aware)
– Message-passing between CPUs
– No locks or cache line bounces
● Reactor (event loop) per thread
● User-space network stack also sharded
22
Futures and continuations
● Futures and continuations are the building
blocks of asynchronous programming in
Seastar.
● Can be composed together to a large, complex,
asynchronous program.
23
Futures and continuations
● A future is a result which may not be available yet:
– Data buffer from the network
– Timer expiration
– Completion of a disk write
– The result of a computation which requires the values
from one or more other futures.
● future<int>
● future<>
24
Futures and continuations
● An asynchronous function (also “promise”) is
a function returning a future:
– future<> sleep(duration)
– future<temporary_buffer<char>> read()
● The function sets up for the future to be fulfilled
– sleep() sets a timer to fulfill the future it returns

Recommended for you

Ceph Performance and Sizing Guide
Ceph Performance and Sizing GuideCeph Performance and Sizing Guide
Ceph Performance and Sizing Guide

This presentation provides an overview of the Dell PowerEdge R730xd server performance results with Red Hat Ceph Storage. It covers the advantages of using Red Hat Ceph Storage on Dell servers with their proven hardware components that provide high scalability, enhanced ROI cost benefits, and support of unstructured data.

cephsoftware defined storagedell
Anil nair rac_internals_sangam_2016
Anil nair rac_internals_sangam_2016Anil nair rac_internals_sangam_2016
Anil nair rac_internals_sangam_2016

This document provides an overview of new features in Oracle Real Application Clusters (RAC) 12c Release 2, including: 1. The Cluster Domain architecture improves scalability by assigning each pluggable database a unique domain ID. 2. Flex diskgroups allow database files to be grouped and managed at the file group level. Quota groups also enable enforcing quota management. 3. The Autonomous Health Framework automates monitoring and problem resolution to reduce downtime.

real application clustersinternalsoracle
Bulk Loading into Cassandra
Bulk Loading into CassandraBulk Loading into Cassandra
Bulk Loading into Cassandra

This document discusses bulk loading and unloading data into and from Cassandra. It describes using CQL INSERT statements via Java drivers or CQLSH COPY FROM for loading, as well as using SSTable files via sstableloader or custom code. For unloading, it recommends using parallel CQL SELECT queries by splitting the token range across multiple connections. Testing showed Java asynchronous INSERTs to be the fastest loading method in most cases, while sstableloader requires all nodes be online. Batching INSERTs can improve throughput but increases latency.

cassandra-loaderbulkcassandra
25
Futures and continuations
● A continuation is a callback, typically a lambda
executed when a future becomes ready
– sleep(1s).then([] {
std::cerr << “done”;
});
● A continuation can hold state (lambda capture)
– future<int> slow_incr(int i) {
sleep(10ms).then(
[i] { return i+1; });
}
26
Futures and continuations
● Continuations can be nested:
– future<int> get();
future<> put(int);
get().then([] (int value) {
put(value+1).then([] {
std::cout << “done”;
});
});
● Or chained:
– get().then([] (int value) {
return put(value+1);
}).then([] {
std::cout << “done”;
});
27
Futures and continuations
● Parallelism is easy:
– sleep(100ms).then([] {
std::cout << “100msn”;
});
sleep(200ms).then([] {
std::cout << “200msn”;
28
Futures and continuations
● In Seastar, every asynchronous operation is a
future:
– Network read or write
– Disk read or write
– Timers
– …
– A complex combination of other futures
● Useful for everything from writing network stack to
writing a full, complex, application.

Recommended for you

Leverage Mesos for running Spark Streaming production jobs by Iulian Dragos a...
Leverage Mesos for running Spark Streaming production jobs by Iulian Dragos a...Leverage Mesos for running Spark Streaming production jobs by Iulian Dragos a...
Leverage Mesos for running Spark Streaming production jobs by Iulian Dragos a...

Mesos is an open source cluster manager that improves resource utilization. It allows Spark Streaming jobs to leverage Mesos fault tolerance features like driver supervision using Marathon. Backpressure is also supported in Spark Streaming to prevent scheduling delays from fast data arrival. Reactive Streams provide more direct backpressure control and are expected in future Spark versions.

spark summit euapache spark
Adventures in Thread-per-Core Async with Redpanda and Seastar
Adventures in Thread-per-Core Async with Redpanda and SeastarAdventures in Thread-per-Core Async with Redpanda and Seastar
Adventures in Thread-per-Core Async with Redpanda and Seastar

Thread-per-core programming models are well known in software domains where latency is important. Pinning application threads to physical cores and handling data in a core-affine way can yield impressive results, but what is it like to build a large scale general purpose software product within these constraints? Redpanda is a high performance persistent stream engine, built on the Seastar framework. This session will describe practical experience of building high performance systems with C++20 in an asynchronous runtime, the unexpected simplicity that can come from strictly mapping data to cores, and explore the challenges & tradeoffs in adopting a thread-per-core architecture.

Cassandra at Pollfish
Cassandra at PollfishCassandra at Pollfish
Cassandra at Pollfish

Cassandra at Pollfish. Athens Cassandra Users MeetUp (Datastax)- 11 March 2015. http://www.meetup.com/Athens-Cassandra-Users/events/220922960/

29
Network zero-copy
● future<temporary_buffer>
input_stream::read()
– temporary_buffer points at driver-provided pages, if
possible.
– Automatically discarded after use (C++).
● future<> output_stream::
write(temporary_buffer)
– Future becomes ready when TCP window allows further
writes (usually immediately).
– Buffer discarded after data is ACKed.
30
Two TCP/IP implementations
Networking API
Seastar (native) Stack POSIX (hosted) stack
Linux kernel (sockets)
User-space TCP/IP
Interface layer
DPDK
Virtio Xen
igb ixgb
31
Disk I/O
● Asynchronous and zero copy, using AIO and
O_DIRECT.
● Not implemented well by all filesystems
– XFS recommended
● Focusing on SSD
● Future thought:
– Direct NVMe support,
– Implement filesystem in Seastar.
32
More info on Seastar
● http://seastar-project.com
● https://github.com/scylladb/seastar
● http://docs.seastar-project.org/
● http://docs.seastar-project.org/master/md_doc_tu
torial.html

Recommended for you

Cassandra at Pollfish
Cassandra at PollfishCassandra at Pollfish
Cassandra at Pollfish

Pollfish is a survey platform which provides access to millions of targeted users. Pollfish allows easy distribution and targeting of surveys through existing mobile apps. (https://www.pollfish.com/). At pollfish we use Cassandra for difference use cases, eg. for application data store to maximize write throughput when appropriate and for our analytics project to find insights in application generated data. As a medium to accomplish our success so far, we use the Datastax's DSE 4.6 environment which integrates Appache Cassadra, Spark and a hadoop compatible file system (CFS). We will discuss how we started, how the journey was and the impressions gained so far along with some tips learned the hard way. This is a result of joint work of an excellent team here at Pollfish.

cassandrapollfish
Community Update at OpenStack Summit Boston
Community Update at OpenStack Summit BostonCommunity Update at OpenStack Summit Boston
Community Update at OpenStack Summit Boston

This community update from Sage Weil at Red Hat provides information on the current and upcoming releases of Ceph. The document summarizes that Luminous is the current stable release, BlueStore is now stable and default, and there have been significant performance improvements for hardware like HDDs. It also outlines many new features and improvements planned or in development for Ceph components like RBD, RGW, CephFS, erasure coding, and more in upcoming releases like Mimic.

ceph distributed storage openstack
What's new in Jewel and Beyond
What's new in Jewel and BeyondWhat's new in Jewel and Beyond
What's new in Jewel and Beyond

This document summarizes new features and upcoming releases for Ceph. In the Jewel release in April 2016, CephFS became more stable with improvements to repair and disaster recovery tools. The BlueStore backend was introduced experimentally to replace Filestore. Future releases Kraken and Luminous will include multi-active MDS support for CephFS, erasure code overwrites for RBD, management tools, and continued optimizations for performance and scalability.

ceph cern distributed storage cloud cephfs
33
ScyllaDB
● NoSQL database, implemented in Seastar.
● Fully compatible with Cassandra:
– Same CQL queries
– Copy over a complete Cassandra database
– Use existing drivers
– Use existing cassandra.yaml
– Use same nodetool or JMX console
– Can be clustered (of course...)
34
ScyllaDBCassandra
Key cache
Row cache
On-
heap /
Off-heap
Linux page cache
SSTables
Unified cache
SSTables
● Don't double-cache.
● Don't cache unrelated rows.
● Don't cache unparsed sstables.
● Can fit much more into cache.
● No page faults, threads, etc.
35
Scylla vs. Cassandra
● Single node benchmark:
– 2 x 12-core x 2 hyperthread Intel(R) Xeon(R) CPU
E5-2690 v3 @ 2.60GHz
cassandra-stress
Benchmark
ScyllaDB Cassandra
Write 1,871,556 251,785
Read 1,585,416 95,874
Mixed 1,372,451 108,947
36
Scylla vs. Cassandra
● We really got a x7 – x16 speedup!
● Read speeded up more -
– Cassandra writes are simpler
– Row-cache benefits further improve Scylla's read
● Almost 2 million writes per second on single
machine!
– Google reported in their blogs achieving 1 million writes
per second on 330 (!) machines
– (2 years ago, and RF=3… but still impressive).

Recommended for you

Cassandra to ScyllaDB: Technical Comparison and the Path to Success
Cassandra to ScyllaDB: Technical Comparison and the Path to SuccessCassandra to ScyllaDB: Technical Comparison and the Path to Success
Cassandra to ScyllaDB: Technical Comparison and the Path to Success

What can you expect when migrating from Cassandra to ScyllaDB? This session provides a jumpstart based on what we’ve learned from working with your peers across hundreds of use cases. Discover how ScyllaDB’s architecture, capabilities, and performance compares to Cassandra’s. Then, hear about your Cassandra to ScyllaDB migration options and practical strategies for success, including our top do’s and don’ts.

Crimson: Ceph for the Age of NVMe and Persistent Memory
Crimson: Ceph for the Age of NVMe and Persistent MemoryCrimson: Ceph for the Age of NVMe and Persistent Memory
Crimson: Ceph for the Age of NVMe and Persistent Memory

Ceph is a mature open source software-defined storage solution that was created over a decade ago. During that time new faster storage technologies have emerged including NVMe and Persistent memory. The crimson project aim is to create a better Ceph OSD that is more well suited to those faster devices. The crimson OSD is built on the Seastar C++ framework and can leverage these devices by minimizing latency, cpu overhead, and cross-core communication. This talk will discuss the project design, our current status, and our future plans.

p99 confp99 latencyceph distributed storage
Linux Huge Pages
Linux Huge PagesLinux Huge Pages
Linux Huge Pages

This document discusses Linux huge pages, including: - What huge pages are and how they can reduce memory management overhead by allocating larger blocks of memory - How to configure huge pages on Linux, including installing required packages, mounting the huge page filesystem, and setting kernel parameters - When huge pages should be configured, such as for data-intensive or latency-sensitive applications like databases, but that testing is required due to disadvantages like reduced swappability

linuxhugepagesmemory
37
Scylla vs. Cassandra
3 node cluster, 2x12 cores each; RF=3, CL=quorum
38
Better latency, at all load levels
39
What will you do with 10x performance?
● Shrink your cluster by a factor of 10
● Use stronger (but slower) data models
● Run more queries - more value from your data
● Stop using caches in front of databases
40

Recommended for you

Ceph, Now and Later: Our Plan for Open Unified Cloud Storage
Ceph, Now and Later: Our Plan for Open Unified Cloud StorageCeph, Now and Later: Our Plan for Open Unified Cloud Storage
Ceph, Now and Later: Our Plan for Open Unified Cloud Storage

Ceph is a highly scalable open source distributed storage system that provides object, block, and file interfaces on a single platform. Although Ceph RBD block storage has dominated OpenStack deployments for several years, maturing object (S3, Swift, and librados) interfaces and stable CephFS (file) interfaces now make Ceph the only fully open source unified storage platform. This talk will cover Ceph's architectural vision and project mission and how our approach differs from alternative approaches to storage in the OpenStack ecosystem. In particular, we will look at how our open development model dovetails well with OpenStack, how major contributors are advancing Ceph capabilities and performance at a rapid pace to adapt to new hardware types and deployment models, and what major features we are priotizing for the next few years to meet the needs of expanding cloud workloads.

storagedistributed storagecloud storage
Ippevent : openshift Introduction
Ippevent : openshift IntroductionIppevent : openshift Introduction
Ippevent : openshift Introduction

This document provides an introduction to Docker and Openshift including discussions around infrastructure, storage, monitoring, metrics, logs, backup, and security considerations. It describes the recommended infrastructure for a 3 node Openshift cluster including masters, etcd, and nodes. It also discusses strategies for storage, monitoring both internal pod status and external infrastructure metrics, collecting and managing logs, backups, and security features within Openshift like limiting resource usage and isolating projects.

openshiftdocker
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark

Introduction to Apache Spark. With an emphasis on the RDD API, Spark SQL (DataFrame and Dataset API) and Spark Streaming. Presented at the Desert Code Camp: http://oct2016.desertcodecamp.com/sessions/all

big dataspark sqlapache spark
41
Do we qualify?
In 3 years, our small team wrote:
● A complete kernel and library (OSv).
● An asynchronous programming framework
(Seastar).
● A complete Cassandra-compatible NoSQL
database (ScyllaDB).
42
43
This project has received funding from the European Union’s
Horizon 2020 research and innovation programme under grant
agreement No 645402.

More Related Content

What's hot

Optimizing Servers for High-Throughput and Low-Latency at Dropbox
Optimizing Servers for High-Throughput and Low-Latency at DropboxOptimizing Servers for High-Throughput and Low-Latency at Dropbox
Optimizing Servers for High-Throughput and Low-Latency at Dropbox
ScyllaDB
 
Cassandra Introduction & Features
Cassandra Introduction & FeaturesCassandra Introduction & Features
Cassandra Introduction & Features
DataStax Academy
 
Scylla Summit 2022: Scylla 5.0 New Features, Part 2
Scylla Summit 2022: Scylla 5.0 New Features, Part 2Scylla Summit 2022: Scylla 5.0 New Features, Part 2
Scylla Summit 2022: Scylla 5.0 New Features, Part 2
ScyllaDB
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm
Chandler Huang
 
Ceph and RocksDB
Ceph and RocksDBCeph and RocksDB
Ceph and RocksDB
Sage Weil
 
Replacing Your Cache with ScyllaDB
Replacing Your Cache with ScyllaDBReplacing Your Cache with ScyllaDB
Replacing Your Cache with ScyllaDB
ScyllaDB
 
HBase Storage Internals
HBase Storage InternalsHBase Storage Internals
HBase Storage Internals
DataWorks Summit
 
Ceph
CephCeph
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...Database Performance at Scale Masterclass: Workload Characteristics by Felipe...
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...
ScyllaDB
 
From Postgres to ScyllaDB: Migration Strategies and Performance Gains
From Postgres to ScyllaDB: Migration Strategies and Performance GainsFrom Postgres to ScyllaDB: Migration Strategies and Performance Gains
From Postgres to ScyllaDB: Migration Strategies and Performance Gains
ScyllaDB
 
Improving Apache Spark's Reliability with DataSourceV2
Improving Apache Spark's Reliability with DataSourceV2Improving Apache Spark's Reliability with DataSourceV2
Improving Apache Spark's Reliability with DataSourceV2
Databricks
 
Spotify: Automating Cassandra repairs
Spotify: Automating Cassandra repairsSpotify: Automating Cassandra repairs
Spotify: Automating Cassandra repairs
DataStax Academy
 
Performance Update: When Apache ORC Met Apache Spark
Performance Update: When Apache ORC Met Apache SparkPerformance Update: When Apache ORC Met Apache Spark
Performance Update: When Apache ORC Met Apache Spark
DataWorks Summit
 
Hadoop Summit 2012 | Optimizing MapReduce Job Performance
Hadoop Summit 2012 | Optimizing MapReduce Job PerformanceHadoop Summit 2012 | Optimizing MapReduce Job Performance
Hadoop Summit 2012 | Optimizing MapReduce Job Performance
Cloudera, Inc.
 
HBase Low Latency
HBase Low LatencyHBase Low Latency
HBase Low Latency
DataWorks Summit
 
Ceph RBD Update - June 2021
Ceph RBD Update - June 2021Ceph RBD Update - June 2021
Ceph RBD Update - June 2021
Ceph Community
 
How Scylla Make Adding and Removing Nodes Faster and Safer
How Scylla Make Adding and Removing Nodes Faster and SaferHow Scylla Make Adding and Removing Nodes Faster and Safer
How Scylla Make Adding and Removing Nodes Faster and Safer
ScyllaDB
 
ORC File & Vectorization - Improving Hive Data Storage and Query Performance
ORC File & Vectorization - Improving Hive Data Storage and Query PerformanceORC File & Vectorization - Improving Hive Data Storage and Query Performance
ORC File & Vectorization - Improving Hive Data Storage and Query Performance
DataWorks Summit
 
Ceph Performance and Sizing Guide
Ceph Performance and Sizing GuideCeph Performance and Sizing Guide
Ceph Performance and Sizing Guide
Jose De La Rosa
 
Anil nair rac_internals_sangam_2016
Anil nair rac_internals_sangam_2016Anil nair rac_internals_sangam_2016
Anil nair rac_internals_sangam_2016
Anil Nair
 

What's hot (20)

Optimizing Servers for High-Throughput and Low-Latency at Dropbox
Optimizing Servers for High-Throughput and Low-Latency at DropboxOptimizing Servers for High-Throughput and Low-Latency at Dropbox
Optimizing Servers for High-Throughput and Low-Latency at Dropbox
 
Cassandra Introduction & Features
Cassandra Introduction & FeaturesCassandra Introduction & Features
Cassandra Introduction & Features
 
Scylla Summit 2022: Scylla 5.0 New Features, Part 2
Scylla Summit 2022: Scylla 5.0 New Features, Part 2Scylla Summit 2022: Scylla 5.0 New Features, Part 2
Scylla Summit 2022: Scylla 5.0 New Features, Part 2
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm
 
Ceph and RocksDB
Ceph and RocksDBCeph and RocksDB
Ceph and RocksDB
 
Replacing Your Cache with ScyllaDB
Replacing Your Cache with ScyllaDBReplacing Your Cache with ScyllaDB
Replacing Your Cache with ScyllaDB
 
HBase Storage Internals
HBase Storage InternalsHBase Storage Internals
HBase Storage Internals
 
Ceph
CephCeph
Ceph
 
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...Database Performance at Scale Masterclass: Workload Characteristics by Felipe...
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...
 
From Postgres to ScyllaDB: Migration Strategies and Performance Gains
From Postgres to ScyllaDB: Migration Strategies and Performance GainsFrom Postgres to ScyllaDB: Migration Strategies and Performance Gains
From Postgres to ScyllaDB: Migration Strategies and Performance Gains
 
Improving Apache Spark's Reliability with DataSourceV2
Improving Apache Spark's Reliability with DataSourceV2Improving Apache Spark's Reliability with DataSourceV2
Improving Apache Spark's Reliability with DataSourceV2
 
Spotify: Automating Cassandra repairs
Spotify: Automating Cassandra repairsSpotify: Automating Cassandra repairs
Spotify: Automating Cassandra repairs
 
Performance Update: When Apache ORC Met Apache Spark
Performance Update: When Apache ORC Met Apache SparkPerformance Update: When Apache ORC Met Apache Spark
Performance Update: When Apache ORC Met Apache Spark
 
Hadoop Summit 2012 | Optimizing MapReduce Job Performance
Hadoop Summit 2012 | Optimizing MapReduce Job PerformanceHadoop Summit 2012 | Optimizing MapReduce Job Performance
Hadoop Summit 2012 | Optimizing MapReduce Job Performance
 
HBase Low Latency
HBase Low LatencyHBase Low Latency
HBase Low Latency
 
Ceph RBD Update - June 2021
Ceph RBD Update - June 2021Ceph RBD Update - June 2021
Ceph RBD Update - June 2021
 
How Scylla Make Adding and Removing Nodes Faster and Safer
How Scylla Make Adding and Removing Nodes Faster and SaferHow Scylla Make Adding and Removing Nodes Faster and Safer
How Scylla Make Adding and Removing Nodes Faster and Safer
 
ORC File & Vectorization - Improving Hive Data Storage and Query Performance
ORC File & Vectorization - Improving Hive Data Storage and Query PerformanceORC File & Vectorization - Improving Hive Data Storage and Query Performance
ORC File & Vectorization - Improving Hive Data Storage and Query Performance
 
Ceph Performance and Sizing Guide
Ceph Performance and Sizing GuideCeph Performance and Sizing Guide
Ceph Performance and Sizing Guide
 
Anil nair rac_internals_sangam_2016
Anil nair rac_internals_sangam_2016Anil nair rac_internals_sangam_2016
Anil nair rac_internals_sangam_2016
 

Similar to Seastar / ScyllaDB, or how we implemented a 10-times faster Cassandra

Bulk Loading into Cassandra
Bulk Loading into CassandraBulk Loading into Cassandra
Bulk Loading into Cassandra
Brian Hess
 
Leverage Mesos for running Spark Streaming production jobs by Iulian Dragos a...
Leverage Mesos for running Spark Streaming production jobs by Iulian Dragos a...Leverage Mesos for running Spark Streaming production jobs by Iulian Dragos a...
Leverage Mesos for running Spark Streaming production jobs by Iulian Dragos a...
Spark Summit
 
Adventures in Thread-per-Core Async with Redpanda and Seastar
Adventures in Thread-per-Core Async with Redpanda and SeastarAdventures in Thread-per-Core Async with Redpanda and Seastar
Adventures in Thread-per-Core Async with Redpanda and Seastar
ScyllaDB
 
Cassandra at Pollfish
Cassandra at PollfishCassandra at Pollfish
Cassandra at Pollfish
Stavros Kontopoulos
 
Cassandra at Pollfish
Cassandra at PollfishCassandra at Pollfish
Cassandra at Pollfish
Pollfish
 
Community Update at OpenStack Summit Boston
Community Update at OpenStack Summit BostonCommunity Update at OpenStack Summit Boston
Community Update at OpenStack Summit Boston
Sage Weil
 
What's new in Jewel and Beyond
What's new in Jewel and BeyondWhat's new in Jewel and Beyond
What's new in Jewel and Beyond
Sage Weil
 
Cassandra to ScyllaDB: Technical Comparison and the Path to Success
Cassandra to ScyllaDB: Technical Comparison and the Path to SuccessCassandra to ScyllaDB: Technical Comparison and the Path to Success
Cassandra to ScyllaDB: Technical Comparison and the Path to Success
ScyllaDB
 
Crimson: Ceph for the Age of NVMe and Persistent Memory
Crimson: Ceph for the Age of NVMe and Persistent MemoryCrimson: Ceph for the Age of NVMe and Persistent Memory
Crimson: Ceph for the Age of NVMe and Persistent Memory
ScyllaDB
 
Linux Huge Pages
Linux Huge PagesLinux Huge Pages
Linux Huge Pages
Geraldo Netto
 
Ceph, Now and Later: Our Plan for Open Unified Cloud Storage
Ceph, Now and Later: Our Plan for Open Unified Cloud StorageCeph, Now and Later: Our Plan for Open Unified Cloud Storage
Ceph, Now and Later: Our Plan for Open Unified Cloud Storage
Sage Weil
 
Ippevent : openshift Introduction
Ippevent : openshift IntroductionIppevent : openshift Introduction
Ippevent : openshift Introduction
kanedafromparis
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
Robert Sanders
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
clairvoyantllc
 
Docker and-containers-for-development-and-deployment-scale12x
Docker and-containers-for-development-and-deployment-scale12xDocker and-containers-for-development-and-deployment-scale12x
Docker and-containers-for-development-and-deployment-scale12x
rkr10
 
iland Internet Solutions: Leveraging Cassandra for real-time multi-datacenter...
iland Internet Solutions: Leveraging Cassandra for real-time multi-datacenter...iland Internet Solutions: Leveraging Cassandra for real-time multi-datacenter...
iland Internet Solutions: Leveraging Cassandra for real-time multi-datacenter...
DataStax Academy
 
Leveraging Cassandra for real-time multi-datacenter public cloud analytics
Leveraging Cassandra for real-time multi-datacenter public cloud analyticsLeveraging Cassandra for real-time multi-datacenter public cloud analytics
Leveraging Cassandra for real-time multi-datacenter public cloud analytics
Julien Anguenot
 
Docker and coreos20141020b
Docker and coreos20141020bDocker and coreos20141020b
Docker and coreos20141020b
Richard Kuo
 
Sanger OpenStack presentation March 2017
Sanger OpenStack presentation March 2017Sanger OpenStack presentation March 2017
Sanger OpenStack presentation March 2017
Dave Holland
 
OSv at Usenix ATC 2014
OSv at Usenix ATC 2014OSv at Usenix ATC 2014
OSv at Usenix ATC 2014
Don Marti
 

Similar to Seastar / ScyllaDB, or how we implemented a 10-times faster Cassandra (20)

Bulk Loading into Cassandra
Bulk Loading into CassandraBulk Loading into Cassandra
Bulk Loading into Cassandra
 
Leverage Mesos for running Spark Streaming production jobs by Iulian Dragos a...
Leverage Mesos for running Spark Streaming production jobs by Iulian Dragos a...Leverage Mesos for running Spark Streaming production jobs by Iulian Dragos a...
Leverage Mesos for running Spark Streaming production jobs by Iulian Dragos a...
 
Adventures in Thread-per-Core Async with Redpanda and Seastar
Adventures in Thread-per-Core Async with Redpanda and SeastarAdventures in Thread-per-Core Async with Redpanda and Seastar
Adventures in Thread-per-Core Async with Redpanda and Seastar
 
Cassandra at Pollfish
Cassandra at PollfishCassandra at Pollfish
Cassandra at Pollfish
 
Cassandra at Pollfish
Cassandra at PollfishCassandra at Pollfish
Cassandra at Pollfish
 
Community Update at OpenStack Summit Boston
Community Update at OpenStack Summit BostonCommunity Update at OpenStack Summit Boston
Community Update at OpenStack Summit Boston
 
What's new in Jewel and Beyond
What's new in Jewel and BeyondWhat's new in Jewel and Beyond
What's new in Jewel and Beyond
 
Cassandra to ScyllaDB: Technical Comparison and the Path to Success
Cassandra to ScyllaDB: Technical Comparison and the Path to SuccessCassandra to ScyllaDB: Technical Comparison and the Path to Success
Cassandra to ScyllaDB: Technical Comparison and the Path to Success
 
Crimson: Ceph for the Age of NVMe and Persistent Memory
Crimson: Ceph for the Age of NVMe and Persistent MemoryCrimson: Ceph for the Age of NVMe and Persistent Memory
Crimson: Ceph for the Age of NVMe and Persistent Memory
 
Linux Huge Pages
Linux Huge PagesLinux Huge Pages
Linux Huge Pages
 
Ceph, Now and Later: Our Plan for Open Unified Cloud Storage
Ceph, Now and Later: Our Plan for Open Unified Cloud StorageCeph, Now and Later: Our Plan for Open Unified Cloud Storage
Ceph, Now and Later: Our Plan for Open Unified Cloud Storage
 
Ippevent : openshift Introduction
Ippevent : openshift IntroductionIppevent : openshift Introduction
Ippevent : openshift Introduction
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Docker and-containers-for-development-and-deployment-scale12x
Docker and-containers-for-development-and-deployment-scale12xDocker and-containers-for-development-and-deployment-scale12x
Docker and-containers-for-development-and-deployment-scale12x
 
iland Internet Solutions: Leveraging Cassandra for real-time multi-datacenter...
iland Internet Solutions: Leveraging Cassandra for real-time multi-datacenter...iland Internet Solutions: Leveraging Cassandra for real-time multi-datacenter...
iland Internet Solutions: Leveraging Cassandra for real-time multi-datacenter...
 
Leveraging Cassandra for real-time multi-datacenter public cloud analytics
Leveraging Cassandra for real-time multi-datacenter public cloud analyticsLeveraging Cassandra for real-time multi-datacenter public cloud analytics
Leveraging Cassandra for real-time multi-datacenter public cloud analytics
 
Docker and coreos20141020b
Docker and coreos20141020bDocker and coreos20141020b
Docker and coreos20141020b
 
Sanger OpenStack presentation March 2017
Sanger OpenStack presentation March 2017Sanger OpenStack presentation March 2017
Sanger OpenStack presentation March 2017
 
OSv at Usenix ATC 2014
OSv at Usenix ATC 2014OSv at Usenix ATC 2014
OSv at Usenix ATC 2014
 

Recently uploaded

Net Zero Case Study: SRK House and SRK Empire
Net Zero Case Study: SRK House and SRK EmpireNet Zero Case Study: SRK House and SRK Empire
Net Zero Case Study: SRK House and SRK Empire
Global Network for Zero
 
CONVEGNO DA IRETI 18 giugno 2024 | PASQUALE Donato
CONVEGNO DA IRETI 18 giugno 2024 | PASQUALE DonatoCONVEGNO DA IRETI 18 giugno 2024 | PASQUALE Donato
CONVEGNO DA IRETI 18 giugno 2024 | PASQUALE Donato
Servizi a rete
 
Bangalore @ℂall @Girls ꧁❤ 0000000000 ❤꧂@ℂall @Girls Service Vip Top Model Safe
Bangalore @ℂall @Girls ꧁❤ 0000000000 ❤꧂@ℂall @Girls Service Vip Top Model SafeBangalore @ℂall @Girls ꧁❤ 0000000000 ❤꧂@ℂall @Girls Service Vip Top Model Safe
Bangalore @ℂall @Girls ꧁❤ 0000000000 ❤꧂@ℂall @Girls Service Vip Top Model Safe
bookhotbebes1
 
21EC63_Module1B.pptx VLSI design 21ec63 MOS TRANSISTOR THEORY
21EC63_Module1B.pptx VLSI design 21ec63 MOS TRANSISTOR THEORY21EC63_Module1B.pptx VLSI design 21ec63 MOS TRANSISTOR THEORY
21EC63_Module1B.pptx VLSI design 21ec63 MOS TRANSISTOR THEORY
PradeepKumarSK3
 
Chlorine and Nitric Acid application, properties, impacts.pptx
Chlorine and Nitric Acid application, properties, impacts.pptxChlorine and Nitric Acid application, properties, impacts.pptx
Chlorine and Nitric Acid application, properties, impacts.pptx
yadavsuyash008
 
CCS367-STORAGE TECHNOLOGIES QUESTION BANK.doc
CCS367-STORAGE TECHNOLOGIES QUESTION BANK.docCCS367-STORAGE TECHNOLOGIES QUESTION BANK.doc
CCS367-STORAGE TECHNOLOGIES QUESTION BANK.doc
Dss
 
Vernier Caliper and How to use Vernier Caliper.ppsx
Vernier Caliper and How to use Vernier Caliper.ppsxVernier Caliper and How to use Vernier Caliper.ppsx
Vernier Caliper and How to use Vernier Caliper.ppsx
Tool and Die Tech
 
Natural Is The Best: Model-Agnostic Code Simplification for Pre-trained Large...
Natural Is The Best: Model-Agnostic Code Simplification for Pre-trained Large...Natural Is The Best: Model-Agnostic Code Simplification for Pre-trained Large...
Natural Is The Best: Model-Agnostic Code Simplification for Pre-trained Large...
YanKing2
 
Biology for computer science BBOC407 vtu
Biology for computer science BBOC407 vtuBiology for computer science BBOC407 vtu
Biology for computer science BBOC407 vtu
santoshpatilrao33
 
Lecture 6 - The effect of Corona effect in Power systems.pdf
Lecture 6 - The effect of Corona effect in Power systems.pdfLecture 6 - The effect of Corona effect in Power systems.pdf
Lecture 6 - The effect of Corona effect in Power systems.pdf
peacekipu
 
LeetCode Database problems solved using PySpark.pdf
LeetCode Database problems solved using PySpark.pdfLeetCode Database problems solved using PySpark.pdf
LeetCode Database problems solved using PySpark.pdf
pavanaroshni1977
 
Trends in Computer Aided Design and MFG.
Trends in Computer Aided Design and MFG.Trends in Computer Aided Design and MFG.
Trends in Computer Aided Design and MFG.
Tool and Die Tech
 
Software Engineering and Project Management - Introduction to Project Management
Software Engineering and Project Management - Introduction to Project ManagementSoftware Engineering and Project Management - Introduction to Project Management
Software Engineering and Project Management - Introduction to Project Management
Prakhyath Rai
 
Conservation of Taksar through Economic Regeneration
Conservation of Taksar through Economic RegenerationConservation of Taksar through Economic Regeneration
Conservation of Taksar through Economic Regeneration
PriyankaKarn3
 
Quadcopter Dynamics, Stability and Control
Quadcopter Dynamics, Stability and ControlQuadcopter Dynamics, Stability and Control
Quadcopter Dynamics, Stability and Control
Blesson Easo Varghese
 
Exploring Deep Learning Models for Image Recognition: A Comparative Review
Exploring Deep Learning Models for Image Recognition: A Comparative ReviewExploring Deep Learning Models for Image Recognition: A Comparative Review
Exploring Deep Learning Models for Image Recognition: A Comparative Review
sipij
 
MSBTE K Scheme MSBTE K Scheme MSBTE K Scheme MSBTE K Scheme
MSBTE K Scheme MSBTE K Scheme MSBTE K Scheme MSBTE K SchemeMSBTE K Scheme MSBTE K Scheme MSBTE K Scheme MSBTE K Scheme
MSBTE K Scheme MSBTE K Scheme MSBTE K Scheme MSBTE K Scheme
Anwar Patel
 
How to Manage Internal Notes in Odoo 17 POS
How to Manage Internal Notes in Odoo 17 POSHow to Manage Internal Notes in Odoo 17 POS
How to Manage Internal Notes in Odoo 17 POS
Celine George
 
SCADAmetrics Instrumentation for Sensus Water Meters - Core and Main Training...
SCADAmetrics Instrumentation for Sensus Water Meters - Core and Main Training...SCADAmetrics Instrumentation for Sensus Water Meters - Core and Main Training...
SCADAmetrics Instrumentation for Sensus Water Meters - Core and Main Training...
Jim Mimlitz, P.E.
 
Rohini @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
Rohini @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model SafeRohini @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
Rohini @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
binna singh$A17
 

Recently uploaded (20)

Net Zero Case Study: SRK House and SRK Empire
Net Zero Case Study: SRK House and SRK EmpireNet Zero Case Study: SRK House and SRK Empire
Net Zero Case Study: SRK House and SRK Empire
 
CONVEGNO DA IRETI 18 giugno 2024 | PASQUALE Donato
CONVEGNO DA IRETI 18 giugno 2024 | PASQUALE DonatoCONVEGNO DA IRETI 18 giugno 2024 | PASQUALE Donato
CONVEGNO DA IRETI 18 giugno 2024 | PASQUALE Donato
 
Bangalore @ℂall @Girls ꧁❤ 0000000000 ❤꧂@ℂall @Girls Service Vip Top Model Safe
Bangalore @ℂall @Girls ꧁❤ 0000000000 ❤꧂@ℂall @Girls Service Vip Top Model SafeBangalore @ℂall @Girls ꧁❤ 0000000000 ❤꧂@ℂall @Girls Service Vip Top Model Safe
Bangalore @ℂall @Girls ꧁❤ 0000000000 ❤꧂@ℂall @Girls Service Vip Top Model Safe
 
21EC63_Module1B.pptx VLSI design 21ec63 MOS TRANSISTOR THEORY
21EC63_Module1B.pptx VLSI design 21ec63 MOS TRANSISTOR THEORY21EC63_Module1B.pptx VLSI design 21ec63 MOS TRANSISTOR THEORY
21EC63_Module1B.pptx VLSI design 21ec63 MOS TRANSISTOR THEORY
 
Chlorine and Nitric Acid application, properties, impacts.pptx
Chlorine and Nitric Acid application, properties, impacts.pptxChlorine and Nitric Acid application, properties, impacts.pptx
Chlorine and Nitric Acid application, properties, impacts.pptx
 
CCS367-STORAGE TECHNOLOGIES QUESTION BANK.doc
CCS367-STORAGE TECHNOLOGIES QUESTION BANK.docCCS367-STORAGE TECHNOLOGIES QUESTION BANK.doc
CCS367-STORAGE TECHNOLOGIES QUESTION BANK.doc
 
Vernier Caliper and How to use Vernier Caliper.ppsx
Vernier Caliper and How to use Vernier Caliper.ppsxVernier Caliper and How to use Vernier Caliper.ppsx
Vernier Caliper and How to use Vernier Caliper.ppsx
 
Natural Is The Best: Model-Agnostic Code Simplification for Pre-trained Large...
Natural Is The Best: Model-Agnostic Code Simplification for Pre-trained Large...Natural Is The Best: Model-Agnostic Code Simplification for Pre-trained Large...
Natural Is The Best: Model-Agnostic Code Simplification for Pre-trained Large...
 
Biology for computer science BBOC407 vtu
Biology for computer science BBOC407 vtuBiology for computer science BBOC407 vtu
Biology for computer science BBOC407 vtu
 
Lecture 6 - The effect of Corona effect in Power systems.pdf
Lecture 6 - The effect of Corona effect in Power systems.pdfLecture 6 - The effect of Corona effect in Power systems.pdf
Lecture 6 - The effect of Corona effect in Power systems.pdf
 
LeetCode Database problems solved using PySpark.pdf
LeetCode Database problems solved using PySpark.pdfLeetCode Database problems solved using PySpark.pdf
LeetCode Database problems solved using PySpark.pdf
 
Trends in Computer Aided Design and MFG.
Trends in Computer Aided Design and MFG.Trends in Computer Aided Design and MFG.
Trends in Computer Aided Design and MFG.
 
Software Engineering and Project Management - Introduction to Project Management
Software Engineering and Project Management - Introduction to Project ManagementSoftware Engineering and Project Management - Introduction to Project Management
Software Engineering and Project Management - Introduction to Project Management
 
Conservation of Taksar through Economic Regeneration
Conservation of Taksar through Economic RegenerationConservation of Taksar through Economic Regeneration
Conservation of Taksar through Economic Regeneration
 
Quadcopter Dynamics, Stability and Control
Quadcopter Dynamics, Stability and ControlQuadcopter Dynamics, Stability and Control
Quadcopter Dynamics, Stability and Control
 
Exploring Deep Learning Models for Image Recognition: A Comparative Review
Exploring Deep Learning Models for Image Recognition: A Comparative ReviewExploring Deep Learning Models for Image Recognition: A Comparative Review
Exploring Deep Learning Models for Image Recognition: A Comparative Review
 
MSBTE K Scheme MSBTE K Scheme MSBTE K Scheme MSBTE K Scheme
MSBTE K Scheme MSBTE K Scheme MSBTE K Scheme MSBTE K SchemeMSBTE K Scheme MSBTE K Scheme MSBTE K Scheme MSBTE K Scheme
MSBTE K Scheme MSBTE K Scheme MSBTE K Scheme MSBTE K Scheme
 
How to Manage Internal Notes in Odoo 17 POS
How to Manage Internal Notes in Odoo 17 POSHow to Manage Internal Notes in Odoo 17 POS
How to Manage Internal Notes in Odoo 17 POS
 
SCADAmetrics Instrumentation for Sensus Water Meters - Core and Main Training...
SCADAmetrics Instrumentation for Sensus Water Meters - Core and Main Training...SCADAmetrics Instrumentation for Sensus Water Meters - Core and Main Training...
SCADAmetrics Instrumentation for Sensus Water Meters - Core and Main Training...
 
Rohini @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
Rohini @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model SafeRohini @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
Rohini @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
 

Seastar / ScyllaDB, or how we implemented a 10-times faster Cassandra

  • 1. Nadav Har'El, ScyllaDB The Generalist Engineer meetup, Tel-Aviv Ides of March, 2016 SeastarSeastar Or how we implemented a 10-times faster Cassandra
  • 2. 2 ● Israeli but multi-national startup company – 15 developers cherry-picked from 10 countries. ● Founded 2013 (“Cloudius Systems”) – by Avi Kivity and Dor Laor of KVM fame. ● Fans of open-source: OSv, Seastar, ScyllaDB.
  • 3. 3 Make Cassandra 10 times faster Your mission, should you choose to accept it:
  • 4. 4 “Make Cassandra 10 times faster” ● Why 10? ● Why Cassandra? – Popular NoSQL database (2nd to MongoDB). – Powerful and widely applicable. – Example of a wider class of middleware. ● Why “mission impossible”? – Cassandra not considered particularly slow - – Considered faster than MongoDB, Hbase, et al. – “disk is bottleneck” (no longer, with SSD!)
  • 5. 5 Our first attempt: OSv ● New OS design specifically for cloud VMs: – Run a single application per VM (“unikernel”) – Run existing Linux applications (Cassandra) – Run these faster than Linux.
  • 6. 6 OSv ● Some of the many ideas we used in OSv: – Single address space. – System call is just a function call. – Faster context switches. – No spin locks. – Smaller code. – Redesigned network stack (Van Jacobson).
  • 7. 7 OSv ● Writing an entire OS from scratch was a really fun exercise for our generalist engineers. ● Full description of OSv is beyond the scope of this talk. Check out: – “OSv—Optimizing the Operating System for Virtual Machines”, Usenix ATC 2014.
  • 8. 8 Cassandra on OSv ● Cassandra-stress, READ, 4 vcpu: On OSv, 34% faster than Linux ● Very nice, but not even close to our goal. What are the remaining bottlenecks?
  • 9. 9 Bottlenecks: API locks ● In one profile, we saw 20% of run on lock() and unlock() operations. Most uncontended – Posix APIs allow threads to share ● file descriptors ● sockets – As many as 20 lock/unlock for each network packet! ● Uncontended locks were efficient on UP (flag to disable preemption), But atomic operations slow on many cores.
  • 10. 10 Bottlenecks: API copies ● Write/send system calls copies user data to kernel – Even on OSv with no user-kernel separation – Part of the socket API ● Similar for read
  • 11. 11 Bottlenecks: context switching ● One thread per CPU is optimal, >1 require: – Context switch time – Stacks consume memory and polute CPU cache – Thread imbalance ● Requires fully non-blocking APIs – Cassandra's uses mmap() for disk….
  • 12. 12 Bottlenecks: unscalable applications ● Contended locks ruin scalability to many cores – Memcache's counter and shared cache ● Solution: per-cpu data. ● Even lock-free atomic algorithms are unscalable – Cache line bouncing ● Again, better to shard, not share, data. – Becomes worse as core count grows ● NUMA
  • 13. 13 Therefore ● Need to provide a better APIs for server applications – Not file descriptors, sockets, threads, etc. ● Need to write better applications.
  • 14. 14 Framework ● One thread per CPU – Event-driven programming – Everything (network & disk) is non-blocking – How to write complex applications?
  • 15. 15 Framework ● Sharded (shared-nothing) applications – Important!
  • 16. 16 Framework ● Language with no runtime overheads or built-in data sharing
  • 17. 17 Seastar ● C++14 library ● For writing new high-performance server applications ● Share-nothing model, fully asynchronous ● Futures & Continuations based – Unified API for all asynchronous operations – Compose complex asyncrhonous operations – The key to complex applications ● (Optionally) full zero-copy user-space TCP/IP (over DPDK) ● Open source: http://www.seastar-project.org/
  • 21. 21 Sharded application design ● One thread per CPU ● Each thread handles one shard of data – No shared data (“share nothing”) – Separate memory per CPU (NUMA aware) – Message-passing between CPUs – No locks or cache line bounces ● Reactor (event loop) per thread ● User-space network stack also sharded
  • 22. 22 Futures and continuations ● Futures and continuations are the building blocks of asynchronous programming in Seastar. ● Can be composed together to a large, complex, asynchronous program.
  • 23. 23 Futures and continuations ● A future is a result which may not be available yet: – Data buffer from the network – Timer expiration – Completion of a disk write – The result of a computation which requires the values from one or more other futures. ● future<int> ● future<>
  • 24. 24 Futures and continuations ● An asynchronous function (also “promise”) is a function returning a future: – future<> sleep(duration) – future<temporary_buffer<char>> read() ● The function sets up for the future to be fulfilled – sleep() sets a timer to fulfill the future it returns
  • 25. 25 Futures and continuations ● A continuation is a callback, typically a lambda executed when a future becomes ready – sleep(1s).then([] { std::cerr << “done”; }); ● A continuation can hold state (lambda capture) – future<int> slow_incr(int i) { sleep(10ms).then( [i] { return i+1; }); }
  • 26. 26 Futures and continuations ● Continuations can be nested: – future<int> get(); future<> put(int); get().then([] (int value) { put(value+1).then([] { std::cout << “done”; }); }); ● Or chained: – get().then([] (int value) { return put(value+1); }).then([] { std::cout << “done”; });
  • 27. 27 Futures and continuations ● Parallelism is easy: – sleep(100ms).then([] { std::cout << “100msn”; }); sleep(200ms).then([] { std::cout << “200msn”;
  • 28. 28 Futures and continuations ● In Seastar, every asynchronous operation is a future: – Network read or write – Disk read or write – Timers – … – A complex combination of other futures ● Useful for everything from writing network stack to writing a full, complex, application.
  • 29. 29 Network zero-copy ● future<temporary_buffer> input_stream::read() – temporary_buffer points at driver-provided pages, if possible. – Automatically discarded after use (C++). ● future<> output_stream:: write(temporary_buffer) – Future becomes ready when TCP window allows further writes (usually immediately). – Buffer discarded after data is ACKed.
  • 30. 30 Two TCP/IP implementations Networking API Seastar (native) Stack POSIX (hosted) stack Linux kernel (sockets) User-space TCP/IP Interface layer DPDK Virtio Xen igb ixgb
  • 31. 31 Disk I/O ● Asynchronous and zero copy, using AIO and O_DIRECT. ● Not implemented well by all filesystems – XFS recommended ● Focusing on SSD ● Future thought: – Direct NVMe support, – Implement filesystem in Seastar.
  • 32. 32 More info on Seastar ● http://seastar-project.com ● https://github.com/scylladb/seastar ● http://docs.seastar-project.org/ ● http://docs.seastar-project.org/master/md_doc_tu torial.html
  • 33. 33 ScyllaDB ● NoSQL database, implemented in Seastar. ● Fully compatible with Cassandra: – Same CQL queries – Copy over a complete Cassandra database – Use existing drivers – Use existing cassandra.yaml – Use same nodetool or JMX console – Can be clustered (of course...)
  • 34. 34 ScyllaDBCassandra Key cache Row cache On- heap / Off-heap Linux page cache SSTables Unified cache SSTables ● Don't double-cache. ● Don't cache unrelated rows. ● Don't cache unparsed sstables. ● Can fit much more into cache. ● No page faults, threads, etc.
  • 35. 35 Scylla vs. Cassandra ● Single node benchmark: – 2 x 12-core x 2 hyperthread Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz cassandra-stress Benchmark ScyllaDB Cassandra Write 1,871,556 251,785 Read 1,585,416 95,874 Mixed 1,372,451 108,947
  • 36. 36 Scylla vs. Cassandra ● We really got a x7 – x16 speedup! ● Read speeded up more - – Cassandra writes are simpler – Row-cache benefits further improve Scylla's read ● Almost 2 million writes per second on single machine! – Google reported in their blogs achieving 1 million writes per second on 330 (!) machines – (2 years ago, and RF=3… but still impressive).
  • 37. 37 Scylla vs. Cassandra 3 node cluster, 2x12 cores each; RF=3, CL=quorum
  • 38. 38 Better latency, at all load levels
  • 39. 39 What will you do with 10x performance? ● Shrink your cluster by a factor of 10 ● Use stronger (but slower) data models ● Run more queries - more value from your data ● Stop using caches in front of databases
  • 40. 40
  • 41. 41 Do we qualify? In 3 years, our small team wrote: ● A complete kernel and library (OSv). ● An asynchronous programming framework (Seastar). ● A complete Cassandra-compatible NoSQL database (ScyllaDB).
  • 42. 42
  • 43. 43 This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 645402.