SlideShare a Scribd company logo
Rethinking Topology in Cassandra


                            ApacheCon Europe
                            November 7, 2012



                                Eric Evans
                            eevans@acunu.com
                               @jericevans


Wednesday, November 7, 12                      1
DHT 101



Wednesday, November 7, 12             2
DHT 101
                                    partitioning
                                        Z    A




Wednesday, November 7, 12                                  3

The keyspace, a namespace encompassing all possible keys
DHT 101
                                      partitioning



                                Z                          A


                            Y                                    B


                                                             C



Wednesday, November 7, 12                                                                     4

The namespace is divided into N partitions (where N is the number of nodes). Partitions are
mapped to nodes and placed evenly throughout the namespace.
DHT 101
                                      partitioning



                                Z                           A


                            Y         Key = Aaa                   B


                                                             C



Wednesday, November 7, 12                                                                   5

A record, stored by key, is positioned on the next node (working clockwise) from where it
sorts in the namespace
DHT 101
                                    replica placement



                                Z                         A


                            Y         Key = Aaa                 B


                                                           C



Wednesday, November 7, 12                                                                  6

Additional copies (replicas) are stored on other nodes. Commonly the next N-1 nodes, but
anything deterministic will work.
DHT 101
                                     consistency




                      Consistency
                      Availability
                      Partition tolerance


Wednesday, November 7, 12                                                                    7

With multiple copies comes a set of trade-offs commonly articulated using the CAP theorem;
At any given point, we can only guarantee 2 of Consistency, Availability, and Partition
tolerance.
DHT 101
                            scenario: consistency level = one


                                 A
                                                                        W

                                       ?



                                  ?



Wednesday, November 7, 12                                                                      8

Writing at consistency level ONE provides very high availability, only one in 3 member nodes
need be up for write to succeed
DHT 101
                            scenario: consistency level = all


                                 A
                                                                         R

                                       ?



                                  ?



Wednesday, November 7, 12                                                                       9

If strong consistency is required, reads with consistency ALL can be used of writes performed
at ONE. The trade-off is in availability, all 3 member nodes must be up, else the read fails.
DHT 101
                            scenario: quorum write


                              A
                                                                 W

                  R+W > N           B



                               ?



Wednesday, November 7, 12                                            10

Using QUORUM consistency, we only require floor((N/2)+1) nodes.
DHT 101
                             scenario: quorum read


                              ?



                  R+W > N           B


                                                                 R
                               C



Wednesday, November 7, 12                                            11

Using QUORUM consistency, we only require floor((N/2)+1) nodes.
Awesome, yes?




Wednesday, November 7, 12                   12
Well...




Wednesday, November 7, 12             13
Problem:
                            Poor load distribution




Wednesday, November 7, 12                            14
Distributing Load

                                 Z       A


                             Y               B


                                         C
                                     M

Wednesday, November 7, 12                        15

B and C hold replicas of A
Distributing Load

                                 Z       A


                             Y               B


                                         C
                                     M

Wednesday, November 7, 12                        16

A and B hold replicas of Z
Distributing Load

                                 Z       A


                             Y               B


                                         C
                                     M

Wednesday, November 7, 12                        17

Z and A hold replicas of Y
Distributing Load

                                 Z       A


                             Y               B


                                         C
                                     M

Wednesday, November 7, 12                        18

Disaster strikes!
Distributing Load

                                 Z                              A


                             Y                                        B


                                                                 C
                                        M

Wednesday, November 7, 12                                                                       19

Sets [Y,Z,A], [Z,A,B], [A,B,C] all suffer the loss of A; Results in extra load on neighboring
nodes
Distributing Load

                                 Z       A
                                         A1


                             Y                B


                                         C
                                     M

Wednesday, November 7, 12                         20

Solution: Replace/repair down node
Distributing Load

                                 Z       A
                                         A1


                             Y                B


                                         C
                                     M

Wednesday, November 7, 12                         21

Solution: Replace/repair down node
Distributing Load

                                 Z                        A
                                                          A1


                             Y                                 B


                                                           C
                                     M

Wednesday, November 7, 12                                                                22

Neighboring nodes are needed to stream missing data to A; Results in even more load on
neighboring nodes
Problem:
                            Poor data distribution




Wednesday, November 7, 12                            23
Distributing Data
                                    A



                                          C
                             D




                                    B

Wednesday, November 7, 12                       24

Ideal distribution of keyspace
Distributing Data
                                              A
                                 E


                                                                  C
                             D




                                              B

Wednesday, November 7, 12                                                        25

Bootstrapping a node, bisecting one partition; Distribution is no longer ideal
Distributing Data
                                          A     A
                                  E


                                                             C
                             D
                                                             C
                              D


                                          B B

Wednesday, November 7, 12                                          26

Moving existing nodes means moving corresponding data; Not ideal
Distributing Data
                                          A     A
                                  E


                                                             C
                             D
                                                             C
                              D


                                          B B

Wednesday, November 7, 12                                          27

Moving existing nodes means moving corresponding data; Not ideal
Distributing Data
                                               A
                                 H                             E


                                                                    C
                             D


                                 G                             F
                                               B

Wednesday, November 7, 12                                                             28

Frequently cited alternative: Double the size of your cluster, bisecting all ranges
Distributing Data
                                               A
                                 H                             E


                                                                    C
                             D


                                 G                             F
                                               B

Wednesday, November 7, 12                                                             29

Frequently cited alternative: Double the size of your cluster, bisecting all ranges
Virtual Nodes



Wednesday, November 7, 12                   30
In a nutshell...

             host


                                                                              host


             host



Wednesday, November 7, 12                                                                     31

Basically: “nodes” on the ring are virtual, and many of them are mapped to each “real” node
(host)
Benefits
                   • Operationally simpler (no token
                            management)
                   •        Better distribution of load
                   •        Concurrent streaming involving all hosts
                   •        Smaller partitions mean greater reliability
                   •        Supports heterogenous hardware


Wednesday, November 7, 12                                                 32
Strategies

                   • Automatic sharding
                   • Fixed partition assignment
                   • Random token assignment



Wednesday, November 7, 12                         33
Strategy
                                        Automatic Sharding



                   • Partitions are split when data exceeds a
                            threshold
                   • Newly created partitions are relocated to a
                            host with lower data load
                   • Similar to sharding performed by Bigtable,
                            or Mongo auto-sharding



Wednesday, November 7, 12                                          34
Strategy
                                     Fixed Partition Assignment

                   • Namespace divided into Q evenly-sized
                            partitions
                   • Q/N partitions assigned per host (where N
                            is the number of hosts)
                   • Joining hosts “steal” partitions evenly from
                            existing hosts.
                   • Used by Dynamo and Voldemort (described
                            in Dynamo paper as “strategy 3”)


Wednesday, November 7, 12                                           35
Strategy
                                    Random Token Assignment



                   • Each host assigned T random tokens
                   • T random tokens generated for joining
                            hosts; New tokens divide existing ranges
                   • Similar to libketama; Identical to Classic
                            Cassandra when T=1



Wednesday, November 7, 12                                              36
Considerations

                   1. Number of partitions
                   2. Partition size
                   3. How 1 changes with more nodes and data
                   4. How 2 changes with more nodes and data




Wednesday, November 7, 12                                      37
Evaluating
                            Strategy        No. Partitions   Partition size

                            Random                 O(N)         O(B/N)


                             Fixed                 O(1)          O(B)


                    Auto-sharding                  O(B)          O(1)

               B ~ total data size, N ~ number of hosts


Wednesday, November 7, 12                                                     38
Evaluating
                   • Automatic sharding
                     • partition size constant (great)
                     • number of partitions scales linearly with
                            data size (bad)
                   • Fixed partition assignment
                   • Random token assignment

Wednesday, November 7, 12                                          39
Evaluating
                   •        Automatic sharding
                   •        Fixed partition assignment
                        •     Number of partitions is constant (good)
                        •     Partition size scales linearly with data size
                              (bad)
                        •     Higher operational complexity (bad)
                   •        Random token assignment


Wednesday, November 7, 12                                                     40
Evaluating
                   • Automatic sharding
                   • Fixed partition assignment
                   • Random token assignment
                     • Number of partitions scales linearly with
                            number of hosts (good ok)
                        • Partition size increases with more data;
                            decreases with more hosts (good)


Wednesday, November 7, 12                                            41
Evaluating


                   • Automatic sharding
                   • Fixed partition assignment
                   • Random token assignment



Wednesday, November 7, 12                         42
Cassandra



Wednesday, November 7, 12               43
Configuration
                               conf/cassandra.yaml


               # Comma separated list of tokens,
               # (new installs only).
               initial_token:<token>,<token>,<token>

               or

               # Number of tokens to generate.
               num_tokens: 256



Wednesday, November 7, 12                                                            44

Two params control how tokens are assigned. The initial_token param now optionally
accepts a csv list, or (preferably) you can assign a numeric value to num_tokens
Configuration
                                     nodetool info

      Token           :     (invoke with -T/--tokens to see all 256 tokens)
      ID              :     64090651-6034-41d5-bfc6-ddd24957f164
      Gossip active   :     true
      Thrift active   :     true
      Load            :     92.69 KB
      Generation No   :     1351030018
      Uptime (seconds):     45
      Heap Memory (MB):     95.16 / 1956.00
      Data Center     :     datacenter1
      Rack            :     rack1
      Exceptions      :     0
      Key Cache       :     size 240 (bytes), capacity 101711872 (bytes ...
      Row Cache       :     size 0 (bytes), capacity 0 (bytes), 0 hits, ...




Wednesday, November 7, 12                                                                      45

To keep the output readable, nodetool info no longer displays tokens (if there are more than
one), unless the -T/--tokens argument is passed
Configuration
                                              nodetool ring
      Datacenter: datacenter1
      ==========
      Replicas: 2

      Address               Rack    Status State    Load         Owns     Token
                                                                          9022770486425350384
      127.0.0.1             rack1   Up     Normal   97.24   KB   66.03%   -9182469192098976078
      127.0.0.1             rack1   Up     Normal   97.24   KB   66.03%   -9054823614314102214
      127.0.0.1             rack1   Up     Normal   97.24   KB   66.03%   -8970752544645156769
      127.0.0.1             rack1   Up     Normal   97.24   KB   66.03%   -8927190060345427739
      127.0.0.1             rack1   Up     Normal   97.24   KB   66.03%   -8880475677109843259
      127.0.0.1             rack1   Up     Normal   97.24   KB   66.03%   -8817876497520861779
      127.0.0.1             rack1   Up     Normal   97.24   KB   66.03%   -8810512134942064901
      127.0.0.1             rack1   Up     Normal   97.24   KB   66.03%   -8661764562509480261
      127.0.0.1             rack1   Up     Normal   97.24   KB   66.03%   -8641550925069186492
      127.0.0.1             rack1   Up     Normal   97.24   KB   66.03%   -8636224350654790732
      ...
      ...




Wednesday, November 7, 12                                                                          46

nodetool ring is still there, but the output is significantly more verbose, and it is less useful
as the go-to
Configuration
                                  nodetool status




      Datacenter: datacenter1
      =======================
      Status=Up/Down
      |/ State=Normal/Leaving/Joining/Moving
      -- Address   Load    Tokens Owns   Host ID                               Rack
      UN 10.0.0.1 97.2 KB 256     66.0% 64090651-6034-41d5-bfc6-ddd24957f164   rack1
      UN 10.0.0.2 92.7 KB 256     66.2% b3c3b03c-9202-4e7b-811a-9de89656ec4c   rack1
      UN 10.0.0.3 92.6 KB 256     67.7% e4eef159-cb77-4627-84c4-14efbc868082   rack1




Wednesday, November 7, 12                                                              47

New go-to command is nodetool status
Configuration
                                     nodetool status




      Datacenter: datacenter1
      =======================
      Status=Up/Down
      |/ State=Normal/Leaving/Joining/Moving
      -- Address   Load    Tokens Owns   Host ID                                       Rack
      UN 10.0.0.1 97.2 KB 256     66.0% 64090651-6034-41d5-bfc6-ddd24957f164           rack1
      UN 10.0.0.2 92.7 KB 256     66.2% b3c3b03c-9202-4e7b-811a-9de89656ec4c           rack1
      UN 10.0.0.3 92.6 KB 256     67.7% e4eef159-cb77-4627-84c4-14efbc868082           rack1




Wednesday, November 7, 12                                                                      48

Of note, since it is no longer practical to name a host by it’s token (because it can have
many), each host has a unique ID
Configuration
                                  nodetool status




      Datacenter: datacenter1
      =======================
      Status=Up/Down
      |/ State=Normal/Leaving/Joining/Moving
      -- Address   Load    Tokens Owns   Host ID                               Rack
      UN 10.0.0.1 97.2 KB 256     66.0% 64090651-6034-41d5-bfc6-ddd24957f164   rack1
      UN 10.0.0.2 92.7 KB 256     66.2% b3c3b03c-9202-4e7b-811a-9de89656ec4c   rack1
      UN 10.0.0.3 92.6 KB 256     67.7% e4eef159-cb77-4627-84c4-14efbc868082   rack1




Wednesday, November 7, 12                                                              49

Note the token per-node count
Migration
                                    A




                            C               B




Wednesday, November 7, 12                       50
Migration
                            edit conf/cassandra.yaml and restart




               # Number of tokens to generate.
               num_tokens: 256




Wednesday, November 7, 12                                          51

Step 1: Set num_tokens in cassandra.yaml, and restart node
Migration
                       convert to T contiguous tokens in existing ranges

                                        A AA
                                   A AA                   B




                                                             A
                                  A




                                                              AA
                                 A
                                A




                                                                 AA A
                               A
                               A
                               A




                                                                 AAA AA
                                 A A
                                       A




                                                         C
                                      A
                                     A



                                    A
                                    A

                                   A
                                   A
                                   A
                                   A




Wednesday, November 7, 12                                                                     52

This will cause the existing range to be split into T contiguous tokens. This results in no
change to placement
Migration
                                         shuffle

                                     A AA
                                A AA                    B




                                                           A
                               A




                                                            AA
                              A
                             A




                                                               AA A
                            A
                            A
                            A




                                                               AAA AA
                             A A
                                   A




                                                       C
                                  A
                                 A



                                A
                                A

                               A
                               A
                               A
                               A




Wednesday, November 7, 12                                                 53

Step 2: Initialize a shuffle operation. Nodes randomly exchange ranges.
Shuffle

                   • Range transfers are queued on each host
                   • Hosts initiate transfer of ranges to self
                   • Pay attention to the logs!


Wednesday, November 7, 12                                        54
Shuffle
                                          bin/shuffle
      Usage: shuffle [options] <sub-command>

      Sub-commands:
       create               Initialize a new shuffle operation
       ls                   List pending relocations
       clear                Clear pending relocations
       en[able]             Enable shuffling
       dis[able]            Disable shuffling

      Options:
       -dc, --only-dc               Apply only to named DC (create only)
       -tp, --thrift-port           Thrift port number (Default: 9160)
       -p,   --port                 JMX port number (Default: 7199)
       -tf, --thrift-framed         Enable framed transport for Thrift (Default: false)
       -en, --and-enable            Immediately enable shuffling (create only)
       -H,   --help                 Print help information
       -h,   --host                 JMX hostname or IP address (Default: localhost)
       -th, --thrift-host           Thrift hostname or IP address (Default: JMX host)




Wednesday, November 7, 12                                                                 55
Performance



Wednesday, November 7, 12                 56
removenode
                 400


                300


                200


                100


                   0
                            Acunu Reflex / Cassandra 1.2   Cassandra 1.1




Wednesday, November 7, 12                                                 57

17 node cluster of EC2 m1.large instances, 460M rows
bootstrap
                500


                375


                250


               125


                   0
                            Acunu Reflex / Cassandra 1.2   Cassandra 1.1




Wednesday, November 7, 12                                                 58

17 node cluster of EC2 m1.large instances, 460M rows
The End
         • Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan
             Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan
             Sivasubramanian, Peter Vosshall and Werner Vogels “Dynamo: Amazon’s
             Highly Available Key-value Store” Web.

         • Low, Richard. “Improving Cassandra's uptime with virtual nodes” Web.
         • Overton, Sam. “Virtual Nodes Strategies.” Web.
         • Overton, Sam. “Virtual Nodes: Performance Results.” Web.
         • Jones, Richard. "libketama - a consistent hashing algo for memcache
             clients” Web.



Wednesday, November 7, 12                                                          59

More Related Content

What's hot

Thrift vs Protocol Buffers vs Avro - Biased Comparison
Thrift vs Protocol Buffers vs Avro - Biased ComparisonThrift vs Protocol Buffers vs Avro - Biased Comparison
Thrift vs Protocol Buffers vs Avro - Biased Comparison
Igor Anishchenko
 
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
mumrah
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
Alexey Grishchenko
 
MyRocks Deep Dive
MyRocks Deep DiveMyRocks Deep Dive
MyRocks Deep Dive
Yoshinori Matsunobu
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to Redis
Dvir Volk
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm
Chandler Huang
 
Redis persistence in practice
Redis persistence in practiceRedis persistence in practice
Redis persistence in practice
Eugene Fidelin
 
From cache to in-memory data grid. Introduction to Hazelcast.
From cache to in-memory data grid. Introduction to Hazelcast.From cache to in-memory data grid. Introduction to Hazelcast.
From cache to in-memory data grid. Introduction to Hazelcast.
Taras Matyashovsky
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
Rahul Jain
 
InnoDB Locking Explained with Stick Figures
InnoDB Locking Explained with Stick FiguresInnoDB Locking Explained with Stick Figures
InnoDB Locking Explained with Stick Figures
Karwin Software Solutions LLC
 
Log Structured Merge Tree
Log Structured Merge TreeLog Structured Merge Tree
Log Structured Merge Tree
University of California, Santa Cruz
 
All Aboard the Databus
All Aboard the DatabusAll Aboard the Databus
All Aboard the Databus
Amy W. Tang
 
[Meetup] a successful migration from elastic search to clickhouse
[Meetup] a successful migration from elastic search to clickhouse[Meetup] a successful migration from elastic search to clickhouse
[Meetup] a successful migration from elastic search to clickhouse
Vianney FOUCAULT
 
Apache Phoenix and HBase: Past, Present and Future of SQL over HBase
Apache Phoenix and HBase: Past, Present and Future of SQL over HBaseApache Phoenix and HBase: Past, Present and Future of SQL over HBase
Apache Phoenix and HBase: Past, Present and Future of SQL over HBase
DataWorks Summit/Hadoop Summit
 
Apache Cassandra at the Geek2Geek Berlin
Apache Cassandra at the Geek2Geek BerlinApache Cassandra at the Geek2Geek Berlin
Apache Cassandra at the Geek2Geek Berlin
Christian Johannsen
 
Introduction to Apache ZooKeeper
Introduction to Apache ZooKeeperIntroduction to Apache ZooKeeper
Introduction to Apache ZooKeeper
Saurav Haloi
 
Facebook Messages & HBase
Facebook Messages & HBaseFacebook Messages & HBase
Facebook Messages & HBase
强 王
 
AWS DynamoDB and Schema Design
AWS DynamoDB and Schema DesignAWS DynamoDB and Schema Design
AWS DynamoDB and Schema Design
Amazon Web Services
 
RocksDB detail
RocksDB detailRocksDB detail
RocksDB detail
MIJIN AN
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & libraries
Walaa Hamdy Assy
 

What's hot (20)

Thrift vs Protocol Buffers vs Avro - Biased Comparison
Thrift vs Protocol Buffers vs Avro - Biased ComparisonThrift vs Protocol Buffers vs Avro - Biased Comparison
Thrift vs Protocol Buffers vs Avro - Biased Comparison
 
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
MyRocks Deep Dive
MyRocks Deep DiveMyRocks Deep Dive
MyRocks Deep Dive
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to Redis
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm
 
Redis persistence in practice
Redis persistence in practiceRedis persistence in practice
Redis persistence in practice
 
From cache to in-memory data grid. Introduction to Hazelcast.
From cache to in-memory data grid. Introduction to Hazelcast.From cache to in-memory data grid. Introduction to Hazelcast.
From cache to in-memory data grid. Introduction to Hazelcast.
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
InnoDB Locking Explained with Stick Figures
InnoDB Locking Explained with Stick FiguresInnoDB Locking Explained with Stick Figures
InnoDB Locking Explained with Stick Figures
 
Log Structured Merge Tree
Log Structured Merge TreeLog Structured Merge Tree
Log Structured Merge Tree
 
All Aboard the Databus
All Aboard the DatabusAll Aboard the Databus
All Aboard the Databus
 
[Meetup] a successful migration from elastic search to clickhouse
[Meetup] a successful migration from elastic search to clickhouse[Meetup] a successful migration from elastic search to clickhouse
[Meetup] a successful migration from elastic search to clickhouse
 
Apache Phoenix and HBase: Past, Present and Future of SQL over HBase
Apache Phoenix and HBase: Past, Present and Future of SQL over HBaseApache Phoenix and HBase: Past, Present and Future of SQL over HBase
Apache Phoenix and HBase: Past, Present and Future of SQL over HBase
 
Apache Cassandra at the Geek2Geek Berlin
Apache Cassandra at the Geek2Geek BerlinApache Cassandra at the Geek2Geek Berlin
Apache Cassandra at the Geek2Geek Berlin
 
Introduction to Apache ZooKeeper
Introduction to Apache ZooKeeperIntroduction to Apache ZooKeeper
Introduction to Apache ZooKeeper
 
Facebook Messages & HBase
Facebook Messages & HBaseFacebook Messages & HBase
Facebook Messages & HBase
 
AWS DynamoDB and Schema Design
AWS DynamoDB and Schema DesignAWS DynamoDB and Schema Design
AWS DynamoDB and Schema Design
 
RocksDB detail
RocksDB detailRocksDB detail
RocksDB detail
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & libraries
 

Viewers also liked

Cassandra Virtual Node talk
Cassandra Virtual Node talkCassandra Virtual Node talk
Cassandra Virtual Node talk
Patrick McFadin
 
Virtual Nodes: Rethinking Topology in Cassandra
Virtual Nodes: Rethinking Topology in CassandraVirtual Nodes: Rethinking Topology in Cassandra
Virtual Nodes: Rethinking Topology in Cassandra
Eric Evans
 
Wikimedia Content API: A Cassandra Use-case
Wikimedia Content API: A Cassandra Use-caseWikimedia Content API: A Cassandra Use-case
Wikimedia Content API: A Cassandra Use-case
Eric Evans
 
Wikimedia Content API: A Cassandra Use-case
Wikimedia Content API: A Cassandra Use-caseWikimedia Content API: A Cassandra Use-case
Wikimedia Content API: A Cassandra Use-case
Eric Evans
 
Castle enhanced Cassandra
Castle enhanced CassandraCastle enhanced Cassandra
Castle enhanced Cassandra
Eric Evans
 
Webinaire Business&Decision - Trifacta
Webinaire  Business&Decision - TrifactaWebinaire  Business&Decision - Trifacta
Webinaire Business&Decision - Trifacta
Victor Coustenoble
 
Time Series Data with Apache Cassandra (ApacheCon EU 2014)
Time Series Data with Apache Cassandra (ApacheCon EU 2014)Time Series Data with Apache Cassandra (ApacheCon EU 2014)
Time Series Data with Apache Cassandra (ApacheCon EU 2014)
Eric Evans
 
Wikimedia Content API (Strangeloop)
Wikimedia Content API (Strangeloop)Wikimedia Content API (Strangeloop)
Wikimedia Content API (Strangeloop)
Eric Evans
 
Webinar Degetel DataStax
Webinar Degetel DataStaxWebinar Degetel DataStax
Webinar Degetel DataStax
Victor Coustenoble
 
DataStax et Apache Cassandra pour la gestion des flux IoT
DataStax et Apache Cassandra pour la gestion des flux IoTDataStax et Apache Cassandra pour la gestion des flux IoT
DataStax et Apache Cassandra pour la gestion des flux IoT
Victor Coustenoble
 
CQL In Cassandra 1.0 (and beyond)
CQL In Cassandra 1.0 (and beyond)CQL In Cassandra 1.0 (and beyond)
CQL In Cassandra 1.0 (and beyond)
Eric Evans
 
DataStax Enterprise BBL
DataStax Enterprise BBLDataStax Enterprise BBL
DataStax Enterprise BBL
Victor Coustenoble
 
Cassandra by Example: Data Modelling with CQL3
Cassandra by Example:  Data Modelling with CQL3Cassandra by Example:  Data Modelling with CQL3
Cassandra by Example: Data Modelling with CQL3
Eric Evans
 
CQL: SQL In Cassandra
CQL: SQL In CassandraCQL: SQL In Cassandra
CQL: SQL In Cassandra
Eric Evans
 
It's not you, it's me: Ending a 15 year relationship with RRD
It's not you, it's me: Ending a 15 year relationship with RRDIt's not you, it's me: Ending a 15 year relationship with RRD
It's not you, it's me: Ending a 15 year relationship with RRD
Eric Evans
 
Rethinking Topology In Cassandra (ApacheCon NA)
Rethinking Topology In Cassandra (ApacheCon NA)Rethinking Topology In Cassandra (ApacheCon NA)
Rethinking Topology In Cassandra (ApacheCon NA)
Eric Evans
 
Time Series Data with Apache Cassandra
Time Series Data with Apache CassandraTime Series Data with Apache Cassandra
Time Series Data with Apache Cassandra
Eric Evans
 
Lightning fast analytics with Cassandra and Spark
Lightning fast analytics with Cassandra and SparkLightning fast analytics with Cassandra and Spark
Lightning fast analytics with Cassandra and Spark
Victor Coustenoble
 
Time Series Data with Apache Cassandra
Time Series Data with Apache CassandraTime Series Data with Apache Cassandra
Time Series Data with Apache Cassandra
Eric Evans
 
Time series storage in Cassandra
Time series storage in CassandraTime series storage in Cassandra
Time series storage in Cassandra
Eric Evans
 

Viewers also liked (20)

Cassandra Virtual Node talk
Cassandra Virtual Node talkCassandra Virtual Node talk
Cassandra Virtual Node talk
 
Virtual Nodes: Rethinking Topology in Cassandra
Virtual Nodes: Rethinking Topology in CassandraVirtual Nodes: Rethinking Topology in Cassandra
Virtual Nodes: Rethinking Topology in Cassandra
 
Wikimedia Content API: A Cassandra Use-case
Wikimedia Content API: A Cassandra Use-caseWikimedia Content API: A Cassandra Use-case
Wikimedia Content API: A Cassandra Use-case
 
Wikimedia Content API: A Cassandra Use-case
Wikimedia Content API: A Cassandra Use-caseWikimedia Content API: A Cassandra Use-case
Wikimedia Content API: A Cassandra Use-case
 
Castle enhanced Cassandra
Castle enhanced CassandraCastle enhanced Cassandra
Castle enhanced Cassandra
 
Webinaire Business&Decision - Trifacta
Webinaire  Business&Decision - TrifactaWebinaire  Business&Decision - Trifacta
Webinaire Business&Decision - Trifacta
 
Time Series Data with Apache Cassandra (ApacheCon EU 2014)
Time Series Data with Apache Cassandra (ApacheCon EU 2014)Time Series Data with Apache Cassandra (ApacheCon EU 2014)
Time Series Data with Apache Cassandra (ApacheCon EU 2014)
 
Wikimedia Content API (Strangeloop)
Wikimedia Content API (Strangeloop)Wikimedia Content API (Strangeloop)
Wikimedia Content API (Strangeloop)
 
Webinar Degetel DataStax
Webinar Degetel DataStaxWebinar Degetel DataStax
Webinar Degetel DataStax
 
DataStax et Apache Cassandra pour la gestion des flux IoT
DataStax et Apache Cassandra pour la gestion des flux IoTDataStax et Apache Cassandra pour la gestion des flux IoT
DataStax et Apache Cassandra pour la gestion des flux IoT
 
CQL In Cassandra 1.0 (and beyond)
CQL In Cassandra 1.0 (and beyond)CQL In Cassandra 1.0 (and beyond)
CQL In Cassandra 1.0 (and beyond)
 
DataStax Enterprise BBL
DataStax Enterprise BBLDataStax Enterprise BBL
DataStax Enterprise BBL
 
Cassandra by Example: Data Modelling with CQL3
Cassandra by Example:  Data Modelling with CQL3Cassandra by Example:  Data Modelling with CQL3
Cassandra by Example: Data Modelling with CQL3
 
CQL: SQL In Cassandra
CQL: SQL In CassandraCQL: SQL In Cassandra
CQL: SQL In Cassandra
 
It's not you, it's me: Ending a 15 year relationship with RRD
It's not you, it's me: Ending a 15 year relationship with RRDIt's not you, it's me: Ending a 15 year relationship with RRD
It's not you, it's me: Ending a 15 year relationship with RRD
 
Rethinking Topology In Cassandra (ApacheCon NA)
Rethinking Topology In Cassandra (ApacheCon NA)Rethinking Topology In Cassandra (ApacheCon NA)
Rethinking Topology In Cassandra (ApacheCon NA)
 
Time Series Data with Apache Cassandra
Time Series Data with Apache CassandraTime Series Data with Apache Cassandra
Time Series Data with Apache Cassandra
 
Lightning fast analytics with Cassandra and Spark
Lightning fast analytics with Cassandra and SparkLightning fast analytics with Cassandra and Spark
Lightning fast analytics with Cassandra and Spark
 
Time Series Data with Apache Cassandra
Time Series Data with Apache CassandraTime Series Data with Apache Cassandra
Time Series Data with Apache Cassandra
 
Time series storage in Cassandra
Time series storage in CassandraTime series storage in Cassandra
Time series storage in Cassandra
 

More from Eric Evans

Cassandra By Example: Data Modelling with CQL3
Cassandra By Example: Data Modelling with CQL3Cassandra By Example: Data Modelling with CQL3
Cassandra By Example: Data Modelling with CQL3
Eric Evans
 
Cassandra: Not Just NoSQL, It's MoSQL
Cassandra: Not Just NoSQL, It's MoSQLCassandra: Not Just NoSQL, It's MoSQL
Cassandra: Not Just NoSQL, It's MoSQL
Eric Evans
 
NoSQL Yes, But YesCQL, No?
NoSQL Yes, But YesCQL, No?NoSQL Yes, But YesCQL, No?
NoSQL Yes, But YesCQL, No?
Eric Evans
 
Cassandra Explained
Cassandra ExplainedCassandra Explained
Cassandra Explained
Eric Evans
 
Cassandra Explained
Cassandra ExplainedCassandra Explained
Cassandra Explained
Eric Evans
 
Outside The Box With Apache Cassnadra
Outside The Box With Apache CassnadraOutside The Box With Apache Cassnadra
Outside The Box With Apache Cassnadra
Eric Evans
 
The Cassandra Distributed Database
The Cassandra Distributed DatabaseThe Cassandra Distributed Database
The Cassandra Distributed Database
Eric Evans
 
An Introduction To Cassandra
An Introduction To CassandraAn Introduction To Cassandra
An Introduction To Cassandra
Eric Evans
 
Cassandra In A Nutshell
Cassandra In A NutshellCassandra In A Nutshell
Cassandra In A Nutshell
Eric Evans
 

More from Eric Evans (9)

Cassandra By Example: Data Modelling with CQL3
Cassandra By Example: Data Modelling with CQL3Cassandra By Example: Data Modelling with CQL3
Cassandra By Example: Data Modelling with CQL3
 
Cassandra: Not Just NoSQL, It's MoSQL
Cassandra: Not Just NoSQL, It's MoSQLCassandra: Not Just NoSQL, It's MoSQL
Cassandra: Not Just NoSQL, It's MoSQL
 
NoSQL Yes, But YesCQL, No?
NoSQL Yes, But YesCQL, No?NoSQL Yes, But YesCQL, No?
NoSQL Yes, But YesCQL, No?
 
Cassandra Explained
Cassandra ExplainedCassandra Explained
Cassandra Explained
 
Cassandra Explained
Cassandra ExplainedCassandra Explained
Cassandra Explained
 
Outside The Box With Apache Cassnadra
Outside The Box With Apache CassnadraOutside The Box With Apache Cassnadra
Outside The Box With Apache Cassnadra
 
The Cassandra Distributed Database
The Cassandra Distributed DatabaseThe Cassandra Distributed Database
The Cassandra Distributed Database
 
An Introduction To Cassandra
An Introduction To CassandraAn Introduction To Cassandra
An Introduction To Cassandra
 
Cassandra In A Nutshell
Cassandra In A NutshellCassandra In A Nutshell
Cassandra In A Nutshell
 

Recently uploaded

TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-InTrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
TrustArc
 
Measuring the Impact of Network Latency at Twitter
Measuring the Impact of Network Latency at TwitterMeasuring the Impact of Network Latency at Twitter
Measuring the Impact of Network Latency at Twitter
ScyllaDB
 
Mitigating the Impact of State Management in Cloud Stream Processing Systems
Mitigating the Impact of State Management in Cloud Stream Processing SystemsMitigating the Impact of State Management in Cloud Stream Processing Systems
Mitigating the Impact of State Management in Cloud Stream Processing Systems
ScyllaDB
 
What's New in Copilot for Microsoft365 May 2024.pptx
What's New in Copilot for Microsoft365 May 2024.pptxWhat's New in Copilot for Microsoft365 May 2024.pptx
What's New in Copilot for Microsoft365 May 2024.pptx
Stephanie Beckett
 
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdfWhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
ArgaBisma
 
Best Practices for Effectively Running dbt in Airflow.pdf
Best Practices for Effectively Running dbt in Airflow.pdfBest Practices for Effectively Running dbt in Airflow.pdf
Best Practices for Effectively Running dbt in Airflow.pdf
Tatiana Al-Chueyr
 
Choose our Linux Web Hosting for a seamless and successful online presence
Choose our Linux Web Hosting for a seamless and successful online presenceChoose our Linux Web Hosting for a seamless and successful online presence
Choose our Linux Web Hosting for a seamless and successful online presence
rajancomputerfbd
 
Comparison Table of DiskWarrior Alternatives.pdf
Comparison Table of DiskWarrior Alternatives.pdfComparison Table of DiskWarrior Alternatives.pdf
Comparison Table of DiskWarrior Alternatives.pdf
Andrey Yasko
 
INDIAN AIR FORCE FIGHTER PLANES LIST.pdf
INDIAN AIR FORCE FIGHTER PLANES LIST.pdfINDIAN AIR FORCE FIGHTER PLANES LIST.pdf
INDIAN AIR FORCE FIGHTER PLANES LIST.pdf
jackson110191
 
20240702 Présentation Plateforme GenAI.pdf
20240702 Présentation Plateforme GenAI.pdf20240702 Présentation Plateforme GenAI.pdf
20240702 Présentation Plateforme GenAI.pdf
Sally Laouacheria
 
Quantum Communications Q&A with Gemini LLM
Quantum Communications Q&A with Gemini LLMQuantum Communications Q&A with Gemini LLM
Quantum Communications Q&A with Gemini LLM
Vijayananda Mohire
 
20240705 QFM024 Irresponsible AI Reading List June 2024
20240705 QFM024 Irresponsible AI Reading List June 202420240705 QFM024 Irresponsible AI Reading List June 2024
20240705 QFM024 Irresponsible AI Reading List June 2024
Matthew Sinclair
 
Pigging Solutions Sustainability brochure.pdf
Pigging Solutions Sustainability brochure.pdfPigging Solutions Sustainability brochure.pdf
Pigging Solutions Sustainability brochure.pdf
Pigging Solutions
 
DealBook of Ukraine: 2024 edition
DealBook of Ukraine: 2024 editionDealBook of Ukraine: 2024 edition
DealBook of Ukraine: 2024 edition
Yevgen Sysoyev
 
Implementations of Fused Deposition Modeling in real world
Implementations of Fused Deposition Modeling  in real worldImplementations of Fused Deposition Modeling  in real world
Implementations of Fused Deposition Modeling in real world
Emerging Tech
 
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
Kief Morris
 
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - Mydbops
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - MydbopsScaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - Mydbops
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - Mydbops
Mydbops
 
BLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALL
BLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALLBLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALL
BLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALL
Liveplex
 
Manual | Product | Research Presentation
Manual | Product | Research PresentationManual | Product | Research Presentation
Manual | Product | Research Presentation
welrejdoall
 
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyyActive Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
RaminGhanbari2
 

Recently uploaded (20)

TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-InTrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
 
Measuring the Impact of Network Latency at Twitter
Measuring the Impact of Network Latency at TwitterMeasuring the Impact of Network Latency at Twitter
Measuring the Impact of Network Latency at Twitter
 
Mitigating the Impact of State Management in Cloud Stream Processing Systems
Mitigating the Impact of State Management in Cloud Stream Processing SystemsMitigating the Impact of State Management in Cloud Stream Processing Systems
Mitigating the Impact of State Management in Cloud Stream Processing Systems
 
What's New in Copilot for Microsoft365 May 2024.pptx
What's New in Copilot for Microsoft365 May 2024.pptxWhat's New in Copilot for Microsoft365 May 2024.pptx
What's New in Copilot for Microsoft365 May 2024.pptx
 
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdfWhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
 
Best Practices for Effectively Running dbt in Airflow.pdf
Best Practices for Effectively Running dbt in Airflow.pdfBest Practices for Effectively Running dbt in Airflow.pdf
Best Practices for Effectively Running dbt in Airflow.pdf
 
Choose our Linux Web Hosting for a seamless and successful online presence
Choose our Linux Web Hosting for a seamless and successful online presenceChoose our Linux Web Hosting for a seamless and successful online presence
Choose our Linux Web Hosting for a seamless and successful online presence
 
Comparison Table of DiskWarrior Alternatives.pdf
Comparison Table of DiskWarrior Alternatives.pdfComparison Table of DiskWarrior Alternatives.pdf
Comparison Table of DiskWarrior Alternatives.pdf
 
INDIAN AIR FORCE FIGHTER PLANES LIST.pdf
INDIAN AIR FORCE FIGHTER PLANES LIST.pdfINDIAN AIR FORCE FIGHTER PLANES LIST.pdf
INDIAN AIR FORCE FIGHTER PLANES LIST.pdf
 
20240702 Présentation Plateforme GenAI.pdf
20240702 Présentation Plateforme GenAI.pdf20240702 Présentation Plateforme GenAI.pdf
20240702 Présentation Plateforme GenAI.pdf
 
Quantum Communications Q&A with Gemini LLM
Quantum Communications Q&A with Gemini LLMQuantum Communications Q&A with Gemini LLM
Quantum Communications Q&A with Gemini LLM
 
20240705 QFM024 Irresponsible AI Reading List June 2024
20240705 QFM024 Irresponsible AI Reading List June 202420240705 QFM024 Irresponsible AI Reading List June 2024
20240705 QFM024 Irresponsible AI Reading List June 2024
 
Pigging Solutions Sustainability brochure.pdf
Pigging Solutions Sustainability brochure.pdfPigging Solutions Sustainability brochure.pdf
Pigging Solutions Sustainability brochure.pdf
 
DealBook of Ukraine: 2024 edition
DealBook of Ukraine: 2024 editionDealBook of Ukraine: 2024 edition
DealBook of Ukraine: 2024 edition
 
Implementations of Fused Deposition Modeling in real world
Implementations of Fused Deposition Modeling  in real worldImplementations of Fused Deposition Modeling  in real world
Implementations of Fused Deposition Modeling in real world
 
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
 
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - Mydbops
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - MydbopsScaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - Mydbops
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - Mydbops
 
BLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALL
BLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALLBLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALL
BLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALL
 
Manual | Product | Research Presentation
Manual | Product | Research PresentationManual | Product | Research Presentation
Manual | Product | Research Presentation
 
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyyActive Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
 

Virtual Nodes: Rethinking Topology in Cassandra

  • 1. Rethinking Topology in Cassandra ApacheCon Europe November 7, 2012 Eric Evans eevans@acunu.com @jericevans Wednesday, November 7, 12 1
  • 3. DHT 101 partitioning Z A Wednesday, November 7, 12 3 The keyspace, a namespace encompassing all possible keys
  • 4. DHT 101 partitioning Z A Y B C Wednesday, November 7, 12 4 The namespace is divided into N partitions (where N is the number of nodes). Partitions are mapped to nodes and placed evenly throughout the namespace.
  • 5. DHT 101 partitioning Z A Y Key = Aaa B C Wednesday, November 7, 12 5 A record, stored by key, is positioned on the next node (working clockwise) from where it sorts in the namespace
  • 6. DHT 101 replica placement Z A Y Key = Aaa B C Wednesday, November 7, 12 6 Additional copies (replicas) are stored on other nodes. Commonly the next N-1 nodes, but anything deterministic will work.
  • 7. DHT 101 consistency Consistency Availability Partition tolerance Wednesday, November 7, 12 7 With multiple copies comes a set of trade-offs commonly articulated using the CAP theorem; At any given point, we can only guarantee 2 of Consistency, Availability, and Partition tolerance.
  • 8. DHT 101 scenario: consistency level = one A W ? ? Wednesday, November 7, 12 8 Writing at consistency level ONE provides very high availability, only one in 3 member nodes need be up for write to succeed
  • 9. DHT 101 scenario: consistency level = all A R ? ? Wednesday, November 7, 12 9 If strong consistency is required, reads with consistency ALL can be used of writes performed at ONE. The trade-off is in availability, all 3 member nodes must be up, else the read fails.
  • 10. DHT 101 scenario: quorum write A W R+W > N B ? Wednesday, November 7, 12 10 Using QUORUM consistency, we only require floor((N/2)+1) nodes.
  • 11. DHT 101 scenario: quorum read ? R+W > N B R C Wednesday, November 7, 12 11 Using QUORUM consistency, we only require floor((N/2)+1) nodes.
  • 14. Problem: Poor load distribution Wednesday, November 7, 12 14
  • 15. Distributing Load Z A Y B C M Wednesday, November 7, 12 15 B and C hold replicas of A
  • 16. Distributing Load Z A Y B C M Wednesday, November 7, 12 16 A and B hold replicas of Z
  • 17. Distributing Load Z A Y B C M Wednesday, November 7, 12 17 Z and A hold replicas of Y
  • 18. Distributing Load Z A Y B C M Wednesday, November 7, 12 18 Disaster strikes!
  • 19. Distributing Load Z A Y B C M Wednesday, November 7, 12 19 Sets [Y,Z,A], [Z,A,B], [A,B,C] all suffer the loss of A; Results in extra load on neighboring nodes
  • 20. Distributing Load Z A A1 Y B C M Wednesday, November 7, 12 20 Solution: Replace/repair down node
  • 21. Distributing Load Z A A1 Y B C M Wednesday, November 7, 12 21 Solution: Replace/repair down node
  • 22. Distributing Load Z A A1 Y B C M Wednesday, November 7, 12 22 Neighboring nodes are needed to stream missing data to A; Results in even more load on neighboring nodes
  • 23. Problem: Poor data distribution Wednesday, November 7, 12 23
  • 24. Distributing Data A C D B Wednesday, November 7, 12 24 Ideal distribution of keyspace
  • 25. Distributing Data A E C D B Wednesday, November 7, 12 25 Bootstrapping a node, bisecting one partition; Distribution is no longer ideal
  • 26. Distributing Data A A E C D C D B B Wednesday, November 7, 12 26 Moving existing nodes means moving corresponding data; Not ideal
  • 27. Distributing Data A A E C D C D B B Wednesday, November 7, 12 27 Moving existing nodes means moving corresponding data; Not ideal
  • 28. Distributing Data A H E C D G F B Wednesday, November 7, 12 28 Frequently cited alternative: Double the size of your cluster, bisecting all ranges
  • 29. Distributing Data A H E C D G F B Wednesday, November 7, 12 29 Frequently cited alternative: Double the size of your cluster, bisecting all ranges
  • 31. In a nutshell... host host host Wednesday, November 7, 12 31 Basically: “nodes” on the ring are virtual, and many of them are mapped to each “real” node (host)
  • 32. Benefits • Operationally simpler (no token management) • Better distribution of load • Concurrent streaming involving all hosts • Smaller partitions mean greater reliability • Supports heterogenous hardware Wednesday, November 7, 12 32
  • 33. Strategies • Automatic sharding • Fixed partition assignment • Random token assignment Wednesday, November 7, 12 33
  • 34. Strategy Automatic Sharding • Partitions are split when data exceeds a threshold • Newly created partitions are relocated to a host with lower data load • Similar to sharding performed by Bigtable, or Mongo auto-sharding Wednesday, November 7, 12 34
  • 35. Strategy Fixed Partition Assignment • Namespace divided into Q evenly-sized partitions • Q/N partitions assigned per host (where N is the number of hosts) • Joining hosts “steal” partitions evenly from existing hosts. • Used by Dynamo and Voldemort (described in Dynamo paper as “strategy 3”) Wednesday, November 7, 12 35
  • 36. Strategy Random Token Assignment • Each host assigned T random tokens • T random tokens generated for joining hosts; New tokens divide existing ranges • Similar to libketama; Identical to Classic Cassandra when T=1 Wednesday, November 7, 12 36
  • 37. Considerations 1. Number of partitions 2. Partition size 3. How 1 changes with more nodes and data 4. How 2 changes with more nodes and data Wednesday, November 7, 12 37
  • 38. Evaluating Strategy No. Partitions Partition size Random O(N) O(B/N) Fixed O(1) O(B) Auto-sharding O(B) O(1) B ~ total data size, N ~ number of hosts Wednesday, November 7, 12 38
  • 39. Evaluating • Automatic sharding • partition size constant (great) • number of partitions scales linearly with data size (bad) • Fixed partition assignment • Random token assignment Wednesday, November 7, 12 39
  • 40. Evaluating • Automatic sharding • Fixed partition assignment • Number of partitions is constant (good) • Partition size scales linearly with data size (bad) • Higher operational complexity (bad) • Random token assignment Wednesday, November 7, 12 40
  • 41. Evaluating • Automatic sharding • Fixed partition assignment • Random token assignment • Number of partitions scales linearly with number of hosts (good ok) • Partition size increases with more data; decreases with more hosts (good) Wednesday, November 7, 12 41
  • 42. Evaluating • Automatic sharding • Fixed partition assignment • Random token assignment Wednesday, November 7, 12 42
  • 44. Configuration conf/cassandra.yaml # Comma separated list of tokens, # (new installs only). initial_token:<token>,<token>,<token> or # Number of tokens to generate. num_tokens: 256 Wednesday, November 7, 12 44 Two params control how tokens are assigned. The initial_token param now optionally accepts a csv list, or (preferably) you can assign a numeric value to num_tokens
  • 45. Configuration nodetool info Token : (invoke with -T/--tokens to see all 256 tokens) ID : 64090651-6034-41d5-bfc6-ddd24957f164 Gossip active : true Thrift active : true Load : 92.69 KB Generation No : 1351030018 Uptime (seconds): 45 Heap Memory (MB): 95.16 / 1956.00 Data Center : datacenter1 Rack : rack1 Exceptions : 0 Key Cache : size 240 (bytes), capacity 101711872 (bytes ... Row Cache : size 0 (bytes), capacity 0 (bytes), 0 hits, ... Wednesday, November 7, 12 45 To keep the output readable, nodetool info no longer displays tokens (if there are more than one), unless the -T/--tokens argument is passed
  • 46. Configuration nodetool ring Datacenter: datacenter1 ========== Replicas: 2 Address Rack Status State Load Owns Token 9022770486425350384 127.0.0.1 rack1 Up Normal 97.24 KB 66.03% -9182469192098976078 127.0.0.1 rack1 Up Normal 97.24 KB 66.03% -9054823614314102214 127.0.0.1 rack1 Up Normal 97.24 KB 66.03% -8970752544645156769 127.0.0.1 rack1 Up Normal 97.24 KB 66.03% -8927190060345427739 127.0.0.1 rack1 Up Normal 97.24 KB 66.03% -8880475677109843259 127.0.0.1 rack1 Up Normal 97.24 KB 66.03% -8817876497520861779 127.0.0.1 rack1 Up Normal 97.24 KB 66.03% -8810512134942064901 127.0.0.1 rack1 Up Normal 97.24 KB 66.03% -8661764562509480261 127.0.0.1 rack1 Up Normal 97.24 KB 66.03% -8641550925069186492 127.0.0.1 rack1 Up Normal 97.24 KB 66.03% -8636224350654790732 ... ... Wednesday, November 7, 12 46 nodetool ring is still there, but the output is significantly more verbose, and it is less useful as the go-to
  • 47. Configuration nodetool status Datacenter: datacenter1 ======================= Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns Host ID Rack UN 10.0.0.1 97.2 KB 256 66.0% 64090651-6034-41d5-bfc6-ddd24957f164 rack1 UN 10.0.0.2 92.7 KB 256 66.2% b3c3b03c-9202-4e7b-811a-9de89656ec4c rack1 UN 10.0.0.3 92.6 KB 256 67.7% e4eef159-cb77-4627-84c4-14efbc868082 rack1 Wednesday, November 7, 12 47 New go-to command is nodetool status
  • 48. Configuration nodetool status Datacenter: datacenter1 ======================= Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns Host ID Rack UN 10.0.0.1 97.2 KB 256 66.0% 64090651-6034-41d5-bfc6-ddd24957f164 rack1 UN 10.0.0.2 92.7 KB 256 66.2% b3c3b03c-9202-4e7b-811a-9de89656ec4c rack1 UN 10.0.0.3 92.6 KB 256 67.7% e4eef159-cb77-4627-84c4-14efbc868082 rack1 Wednesday, November 7, 12 48 Of note, since it is no longer practical to name a host by it’s token (because it can have many), each host has a unique ID
  • 49. Configuration nodetool status Datacenter: datacenter1 ======================= Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns Host ID Rack UN 10.0.0.1 97.2 KB 256 66.0% 64090651-6034-41d5-bfc6-ddd24957f164 rack1 UN 10.0.0.2 92.7 KB 256 66.2% b3c3b03c-9202-4e7b-811a-9de89656ec4c rack1 UN 10.0.0.3 92.6 KB 256 67.7% e4eef159-cb77-4627-84c4-14efbc868082 rack1 Wednesday, November 7, 12 49 Note the token per-node count
  • 50. Migration A C B Wednesday, November 7, 12 50
  • 51. Migration edit conf/cassandra.yaml and restart # Number of tokens to generate. num_tokens: 256 Wednesday, November 7, 12 51 Step 1: Set num_tokens in cassandra.yaml, and restart node
  • 52. Migration convert to T contiguous tokens in existing ranges A AA A AA B A A AA A A AA A A A A AAA AA A A A C A A A A A A A A Wednesday, November 7, 12 52 This will cause the existing range to be split into T contiguous tokens. This results in no change to placement
  • 53. Migration shuffle A AA A AA B A A AA A A AA A A A A AAA AA A A A C A A A A A A A A Wednesday, November 7, 12 53 Step 2: Initialize a shuffle operation. Nodes randomly exchange ranges.
  • 54. Shuffle • Range transfers are queued on each host • Hosts initiate transfer of ranges to self • Pay attention to the logs! Wednesday, November 7, 12 54
  • 55. Shuffle bin/shuffle Usage: shuffle [options] <sub-command> Sub-commands: create Initialize a new shuffle operation ls List pending relocations clear Clear pending relocations en[able] Enable shuffling dis[able] Disable shuffling Options: -dc, --only-dc Apply only to named DC (create only) -tp, --thrift-port Thrift port number (Default: 9160) -p, --port JMX port number (Default: 7199) -tf, --thrift-framed Enable framed transport for Thrift (Default: false) -en, --and-enable Immediately enable shuffling (create only) -H, --help Print help information -h, --host JMX hostname or IP address (Default: localhost) -th, --thrift-host Thrift hostname or IP address (Default: JMX host) Wednesday, November 7, 12 55
  • 57. removenode 400 300 200 100 0 Acunu Reflex / Cassandra 1.2 Cassandra 1.1 Wednesday, November 7, 12 57 17 node cluster of EC2 m1.large instances, 460M rows
  • 58. bootstrap 500 375 250 125 0 Acunu Reflex / Cassandra 1.2 Cassandra 1.1 Wednesday, November 7, 12 58 17 node cluster of EC2 m1.large instances, 460M rows
  • 59. The End • Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall and Werner Vogels “Dynamo: Amazon’s Highly Available Key-value Store” Web. • Low, Richard. “Improving Cassandra's uptime with virtual nodes” Web. • Overton, Sam. “Virtual Nodes Strategies.” Web. • Overton, Sam. “Virtual Nodes: Performance Results.” Web. • Jones, Richard. "libketama - a consistent hashing algo for memcache clients” Web. Wednesday, November 7, 12 59