Seastar / ScyllaDB, or how we implemented a 10-times faster Cassandra

Nadav Har'El, ScyllaDB
The Generalist Engineer meetup, Tel-Aviv
Ides of March, 2016
SeastarSeastar Or how we implemented a
10-times faster Cassandra

2
● Israeli but multi-national startup company
– 15 developers cherry-picked from 10 countries.
● Founded 2013 (“Cloudius Systems”)
– by Avi Kivity and Dor Laor of KVM fame.
● Fans of open-source: OSv, Seastar, ScyllaDB.

3
Make Cassandra 10 times faster
Your mission, should
you choose to accept it:

4
“Make Cassandra 10 times faster”
● Why 10?
● Why Cassandra?
– Popular NoSQL database (2nd to MongoDB).
– Powerful and widely applicable.
– Example of a wider class of middleware.
● Why “mission impossible”?
– Cassandra not considered particularly slow -
– Considered faster than MongoDB, Hbase, et al.
– “disk is bottleneck” (no longer, with SSD!)

5
Our first attempt: OSv
● New OS design specifically for cloud VMs:
– Run a single application per VM (“unikernel”)
– Run existing Linux applications (Cassandra)
– Run these faster than Linux.

6
OSv
●
Some of the many ideas we used in OSv:
– Single address space.
– System call is just a function call.
– Faster context switches.
– No spin locks.
– Smaller code.
– Redesigned network stack (Van Jacobson).

7
OSv
● Writing an entire OS from scratch was a really
fun exercise for our generalist engineers.
●
Full description of OSv is beyond the scope of
this talk. Check out:
– “OSv—Optimizing the Operating System for Virtual
Machines”, Usenix ATC 2014.

8
Cassandra on OSv
● Cassandra-stress, READ, 4 vcpu:
On OSv, 34% faster than Linux
● Very nice, but not even close to our goal.
What are the remaining bottlenecks?

9
Bottlenecks: API locks
● In one profile, we saw 20% of run on lock()
and unlock() operations. Most uncontended
– Posix APIs allow threads to share
● file descriptors
● sockets
– As many as 20 lock/unlock for each network packet!
● Uncontended locks were efficient on UP (flag to
disable preemption),
But atomic operations slow on many cores.

10
Bottlenecks: API copies
● Write/send system calls copies user data to
kernel
– Even on OSv with no user-kernel separation
– Part of the socket API
● Similar for read

11
Bottlenecks: context switching
● One thread per CPU is optimal, >1 require:
– Context switch time
– Stacks consume memory and polute CPU cache
– Thread imbalance
● Requires fully non-blocking APIs
– Cassandra's uses mmap() for disk….

12
Bottlenecks:
unscalable applications
● Contended locks ruin scalability to many cores
– Memcache's counter and shared cache
● Solution: per-cpu data.
● Even lock-free atomic algorithms are unscalable
– Cache line bouncing
● Again, better to shard, not share, data.
– Becomes worse as core count grows
● NUMA

13
Therefore
● Need to provide a better APIs for server
applications
– Not file descriptors, sockets, threads, etc.
● Need to write better applications.

14
Framework
● One thread per CPU
– Event-driven programming
– Everything (network & disk) is non-blocking
– How to write complex applications?

15
Framework
● Sharded (shared-nothing) applications
– Important!

16
Framework
● Language with no runtime overheads or built-in
data sharing

17
Seastar
● C++14 library
● For writing new high-performance server applications
● Share-nothing model, fully asynchronous
● Futures & Continuations based
– Unified API for all asynchronous operations
– Compose complex asyncrhonous operations
– The key to complex applications
● (Optionally) full zero-copy user-space TCP/IP (over DPDK)
● Open source: http://www.seastar-project.org/

18
Seastar linear scaling in #cores

19
Seastar linear scaling in #cores

20
Brief introduction to Seastar

21
Sharded application design
● One thread per CPU
● Each thread handles one shard of data
– No shared data (“share nothing”)
– Separate memory per CPU (NUMA aware)
– Message-passing between CPUs
– No locks or cache line bounces
● Reactor (event loop) per thread
● User-space network stack also sharded

22
Futures and continuations
● Futures and continuations are the building
blocks of asynchronous programming in
Seastar.
● Can be composed together to a large, complex,
asynchronous program.

23
● A future is a result which may not be available yet:
– Data buffer from the network
– Timer expiration
– Completion of a disk write
– The result of a computation which requires the values
from one or more other futures.
● future<int>
● future<>

24
● An asynchronous function (also “promise”) is
a function returning a future:
– future<> sleep(duration)
– future<temporary_buffer<char>> read()
● The function sets up for the future to be fulfilled
– sleep() sets a timer to fulfill the future it returns

25
● A continuation is a callback, typically a lambda
executed when a future becomes ready
– sleep(1s).then([] {
std::cerr << “done”;
});
● A continuation can hold state (lambda capture)
– future<int> slow_incr(int i) {
sleep(10ms).then(
[i] { return i+1; });
}

26
● Continuations can be nested:
– future<int> get();
future<> put(int);
get().then([] (int value) {
put(value+1).then([] {
std::cout << “done”;
});
});
● Or chained:
– get().then([] (int value) {
return put(value+1);
}).then([] {
std::cout << “done”;
});

27
● Parallelism is easy:
– sleep(100ms).then([] {
std::cout << “100msn”;
});
sleep(200ms).then([] {
std::cout << “200msn”;

28
● In Seastar, every asynchronous operation is a
future:
– Network read or write
– Disk read or write
– Timers
– …
– A complex combination of other futures
● Useful for everything from writing network stack to
writing a full, complex, application.

29
Network zero-copy
● future<temporary_buffer>
input_stream::read()
– temporary_buffer points at driver-provided pages, if
possible.
– Automatically discarded after use (C++).
● future<> output_stream::
write(temporary_buffer)
– Future becomes ready when TCP window allows further
writes (usually immediately).
– Buffer discarded after data is ACKed.

30
Two TCP/IP implementations
Networking API
Seastar (native) Stack POSIX (hosted) stack
Linux kernel (sockets)
User-space TCP/IP
Interface layer
DPDK
Virtio Xen
igb ixgb

31
Disk I/O
● Asynchronous and zero copy, using AIO and
O_DIRECT.
● Not implemented well by all filesystems
– XFS recommended
● Focusing on SSD
● Future thought:
– Direct NVMe support,
– Implement filesystem in Seastar.

32
More info on Seastar
● http://seastar-project.com
● https://github.com/scylladb/seastar
● http://docs.seastar-project.org/
● http://docs.seastar-project.org/master/md_doc_tu
torial.html

33
ScyllaDB
● NoSQL database, implemented in Seastar.
● Fully compatible with Cassandra:
– Same CQL queries
– Copy over a complete Cassandra database
– Use existing drivers
– Use existing cassandra.yaml
– Use same nodetool or JMX console
– Can be clustered (of course...)

34
ScyllaDBCassandra
Key cache
Row cache
On-
heap /
Off-heap
Linux page cache
SSTables
Unified cache
SSTables
● Don't double-cache.
● Don't cache unrelated rows.
● Don't cache unparsed sstables.
● Can fit much more into cache.
● No page faults, threads, etc.

35
Scylla vs. Cassandra
● Single node benchmark:
– 2 x 12-core x 2 hyperthread Intel(R) Xeon(R) CPU
E5-2690 v3 @ 2.60GHz
cassandra-stress
Benchmark
ScyllaDB Cassandra
Write 1,871,556 251,785
Read 1,585,416 95,874
Mixed 1,372,451 108,947

36
● We really got a x7 – x16 speedup!
● Read speeded up more -
– Cassandra writes are simpler
– Row-cache benefits further improve Scylla's read
● Almost 2 million writes per second on single
machine!
– Google reported in their blogs achieving 1 million writes
per second on 330 (!) machines
– (2 years ago, and RF=3… but still impressive).

37
3 node cluster, 2x12 cores each; RF=3, CL=quorum

38
Better latency, at all load levels

39
What will you do with 10x performance?
● Shrink your cluster by a factor of 10
● Use stronger (but slower) data models
● Run more queries - more value from your data
● Stop using caches in front of databases

41
Do we qualify?
In 3 years, our small team wrote:
● A complete kernel and library (OSv).
● An asynchronous programming framework
(Seastar).
● A complete Cassandra-compatible NoSQL
database (ScyllaDB).

43
This project has received funding from the European Union’s
Horizon 2020 research and innovation programme under grant
agreement No 645402.

Seastar / ScyllaDB, or how we implemented a 10-times faster Cassandra

Related slideshows

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

More Related Content

What's hot

What's hot (20)

Similar to Seastar / ScyllaDB, or how we implemented a 10-times faster Cassandra

Similar to Seastar / ScyllaDB, or how we implemented a 10-times faster Cassandra (20)

Recently uploaded

Recently uploaded (20)

Seastar / ScyllaDB, or how we implemented a 10-times faster Cassandra