SlideShare a Scribd company logo
Rethinking Topology in Cassandra

                            ApacheCon Europe
                            November 7, 2012

                                Eric Evans

Wednesday, November 7, 12                      1
DHT 101

Wednesday, November 7, 12             2
DHT 101
                                        Z    A

Wednesday, November 7, 12                                  3

The keyspace, a namespace encompassing all possible keys
DHT 101

                                Z                          A

                            Y                                    B


Wednesday, November 7, 12                                                                     4

The namespace is divided into N partitions (where N is the number of nodes). Partitions are
mapped to nodes and placed evenly throughout the namespace.
DHT 101

                                Z                           A

                            Y         Key = Aaa                   B


Wednesday, November 7, 12                                                                   5

A record, stored by key, is positioned on the next node (working clockwise) from where it
sorts in the namespace
DHT 101
                                    replica placement

                                Z                         A

                            Y         Key = Aaa                 B


Wednesday, November 7, 12                                                                  6

Additional copies (replicas) are stored on other nodes. Commonly the next N-1 nodes, but
anything deterministic will work.
DHT 101

                      Partition tolerance

Wednesday, November 7, 12                                                                    7

With multiple copies comes a set of trade-offs commonly articulated using the CAP theorem;
At any given point, we can only guarantee 2 of Consistency, Availability, and Partition
DHT 101
                            scenario: consistency level = one




Wednesday, November 7, 12                                                                      8

Writing at consistency level ONE provides very high availability, only one in 3 member nodes
need be up for write to succeed
DHT 101
                            scenario: consistency level = all




Wednesday, November 7, 12                                                                       9

If strong consistency is required, reads with consistency ALL can be used of writes performed
at ONE. The trade-off is in availability, all 3 member nodes must be up, else the read fails.
DHT 101
                            scenario: quorum write


                  R+W > N           B


Wednesday, November 7, 12                                            10

Using QUORUM consistency, we only require floor((N/2)+1) nodes.
DHT 101
                             scenario: quorum read


                  R+W > N           B


Wednesday, November 7, 12                                            11

Using QUORUM consistency, we only require floor((N/2)+1) nodes.
Awesome, yes?

Wednesday, November 7, 12                   12

Wednesday, November 7, 12             13
                            Poor load distribution

Wednesday, November 7, 12                            14
Distributing Load

                                 Z       A

                             Y               B


Wednesday, November 7, 12                        15

B and C hold replicas of A
Distributing Load

                                 Z       A

                             Y               B


Wednesday, November 7, 12                        16

A and B hold replicas of Z
Distributing Load

                                 Z       A

                             Y               B


Wednesday, November 7, 12                        17

Z and A hold replicas of Y
Distributing Load

                                 Z       A

                             Y               B


Wednesday, November 7, 12                        18

Disaster strikes!
Distributing Load

                                 Z                              A

                             Y                                        B


Wednesday, November 7, 12                                                                       19

Sets [Y,Z,A], [Z,A,B], [A,B,C] all suffer the loss of A; Results in extra load on neighboring
Distributing Load

                                 Z       A

                             Y                B


Wednesday, November 7, 12                         20

Solution: Replace/repair down node
Distributing Load

                                 Z       A

                             Y                B


Wednesday, November 7, 12                         21

Solution: Replace/repair down node
Distributing Load

                                 Z                        A

                             Y                                 B


Wednesday, November 7, 12                                                                22

Neighboring nodes are needed to stream missing data to A; Results in even more load on
neighboring nodes
                            Poor data distribution

Wednesday, November 7, 12                            23
Distributing Data



Wednesday, November 7, 12                       24

Ideal distribution of keyspace
Distributing Data



Wednesday, November 7, 12                                                        25

Bootstrapping a node, bisecting one partition; Distribution is no longer ideal
Distributing Data
                                          A     A


                                          B B

Wednesday, November 7, 12                                          26

Moving existing nodes means moving corresponding data; Not ideal
Distributing Data
                                          A     A


                                          B B

Wednesday, November 7, 12                                          27

Moving existing nodes means moving corresponding data; Not ideal
Distributing Data
                                 H                             E


                                 G                             F

Wednesday, November 7, 12                                                             28

Frequently cited alternative: Double the size of your cluster, bisecting all ranges
Distributing Data
                                 H                             E


                                 G                             F

Wednesday, November 7, 12                                                             29

Frequently cited alternative: Double the size of your cluster, bisecting all ranges
Virtual Nodes

Wednesday, November 7, 12                   30
In a nutshell...




Wednesday, November 7, 12                                                                     31

Basically: “nodes” on the ring are virtual, and many of them are mapped to each “real” node
                   • Operationally simpler (no token
                   •        Better distribution of load
                   •        Concurrent streaming involving all hosts
                   •        Smaller partitions mean greater reliability
                   •        Supports heterogenous hardware

Wednesday, November 7, 12                                                 32

                   • Automatic sharding
                   • Fixed partition assignment
                   • Random token assignment

Wednesday, November 7, 12                         33
                                        Automatic Sharding

                   • Partitions are split when data exceeds a
                   • Newly created partitions are relocated to a
                            host with lower data load
                   • Similar to sharding performed by Bigtable,
                            or Mongo auto-sharding

Wednesday, November 7, 12                                          34
                                     Fixed Partition Assignment

                   • Namespace divided into Q evenly-sized
                   • Q/N partitions assigned per host (where N
                            is the number of hosts)
                   • Joining hosts “steal” partitions evenly from
                            existing hosts.
                   • Used by Dynamo and Voldemort (described
                            in Dynamo paper as “strategy 3”)

Wednesday, November 7, 12                                           35
                                    Random Token Assignment

                   • Each host assigned T random tokens
                   • T random tokens generated for joining
                            hosts; New tokens divide existing ranges
                   • Similar to libketama; Identical to Classic
                            Cassandra when T=1

Wednesday, November 7, 12                                              36

                   1. Number of partitions
                   2. Partition size
                   3. How 1 changes with more nodes and data
                   4. How 2 changes with more nodes and data

Wednesday, November 7, 12                                      37
                            Strategy        No. Partitions   Partition size

                            Random                 O(N)         O(B/N)

                             Fixed                 O(1)          O(B)

                    Auto-sharding                  O(B)          O(1)

               B ~ total data size, N ~ number of hosts

Wednesday, November 7, 12                                                     38
                   • Automatic sharding
                     • partition size constant (great)
                     • number of partitions scales linearly with
                            data size (bad)
                   • Fixed partition assignment
                   • Random token assignment

Wednesday, November 7, 12                                          39
                   •        Automatic sharding
                   •        Fixed partition assignment
                        •     Number of partitions is constant (good)
                        •     Partition size scales linearly with data size
                        •     Higher operational complexity (bad)
                   •        Random token assignment

Wednesday, November 7, 12                                                     40
                   • Automatic sharding
                   • Fixed partition assignment
                   • Random token assignment
                     • Number of partitions scales linearly with
                            number of hosts (good ok)
                        • Partition size increases with more data;
                            decreases with more hosts (good)

Wednesday, November 7, 12                                            41

                   • Automatic sharding
                   • Fixed partition assignment
                   • Random token assignment

Wednesday, November 7, 12                         42

Wednesday, November 7, 12               43

               # Comma separated list of tokens,
               # (new installs only).


               # Number of tokens to generate.
               num_tokens: 256

Wednesday, November 7, 12                                                            44

Two params control how tokens are assigned. The initial_token param now optionally
accepts a csv list, or (preferably) you can assign a numeric value to num_tokens
                                     nodetool info

      Token           :     (invoke with -T/--tokens to see all 256 tokens)
      ID              :     64090651-6034-41d5-bfc6-ddd24957f164
      Gossip active   :     true
      Thrift active   :     true
      Load            :     92.69 KB
      Generation No   :     1351030018
      Uptime (seconds):     45
      Heap Memory (MB):     95.16 / 1956.00
      Data Center     :     datacenter1
      Rack            :     rack1
      Exceptions      :     0
      Key Cache       :     size 240 (bytes), capacity 101711872 (bytes ...
      Row Cache       :     size 0 (bytes), capacity 0 (bytes), 0 hits, ...

Wednesday, November 7, 12                                                                      45

To keep the output readable, nodetool info no longer displays tokens (if there are more than
one), unless the -T/--tokens argument is passed
                                              nodetool ring
      Datacenter: datacenter1
      Replicas: 2

      Address               Rack    Status State    Load         Owns     Token
                                                                          9022770486425350384             rack1   Up     Normal   97.24   KB   66.03%   -9182469192098976078             rack1   Up     Normal   97.24   KB   66.03%   -9054823614314102214             rack1   Up     Normal   97.24   KB   66.03%   -8970752544645156769             rack1   Up     Normal   97.24   KB   66.03%   -8927190060345427739             rack1   Up     Normal   97.24   KB   66.03%   -8880475677109843259             rack1   Up     Normal   97.24   KB   66.03%   -8817876497520861779             rack1   Up     Normal   97.24   KB   66.03%   -8810512134942064901             rack1   Up     Normal   97.24   KB   66.03%   -8661764562509480261             rack1   Up     Normal   97.24   KB   66.03%   -8641550925069186492             rack1   Up     Normal   97.24   KB   66.03%   -8636224350654790732

Wednesday, November 7, 12                                                                          46

nodetool ring is still there, but the output is significantly more verbose, and it is less useful
as the go-to
                                  nodetool status

      Datacenter: datacenter1
      |/ State=Normal/Leaving/Joining/Moving
      -- Address   Load    Tokens Owns   Host ID                               Rack
      UN 97.2 KB 256     66.0% 64090651-6034-41d5-bfc6-ddd24957f164   rack1
      UN 92.7 KB 256     66.2% b3c3b03c-9202-4e7b-811a-9de89656ec4c   rack1
      UN 92.6 KB 256     67.7% e4eef159-cb77-4627-84c4-14efbc868082   rack1

Wednesday, November 7, 12                                                              47

New go-to command is nodetool status
                                     nodetool status

      Datacenter: datacenter1
      |/ State=Normal/Leaving/Joining/Moving
      -- Address   Load    Tokens Owns   Host ID                                       Rack
      UN 97.2 KB 256     66.0% 64090651-6034-41d5-bfc6-ddd24957f164           rack1
      UN 92.7 KB 256     66.2% b3c3b03c-9202-4e7b-811a-9de89656ec4c           rack1
      UN 92.6 KB 256     67.7% e4eef159-cb77-4627-84c4-14efbc868082           rack1

Wednesday, November 7, 12                                                                      48

Of note, since it is no longer practical to name a host by it’s token (because it can have
many), each host has a unique ID
                                  nodetool status

      Datacenter: datacenter1
      |/ State=Normal/Leaving/Joining/Moving
      -- Address   Load    Tokens Owns   Host ID                               Rack
      UN 97.2 KB 256     66.0% 64090651-6034-41d5-bfc6-ddd24957f164   rack1
      UN 92.7 KB 256     66.2% b3c3b03c-9202-4e7b-811a-9de89656ec4c   rack1
      UN 92.6 KB 256     67.7% e4eef159-cb77-4627-84c4-14efbc868082   rack1

Wednesday, November 7, 12                                                              49

Note the token per-node count

                            C               B

Wednesday, November 7, 12                       50
                            edit conf/cassandra.yaml and restart

               # Number of tokens to generate.
               num_tokens: 256

Wednesday, November 7, 12                                          51

Step 1: Set num_tokens in cassandra.yaml, and restart node
                       convert to T contiguous tokens in existing ranges

                                        A AA
                                   A AA                   B



                                                                 AA A

                                                                 AAA AA
                                 A A




Wednesday, November 7, 12                                                                     52

This will cause the existing range to be split into T contiguous tokens. This results in no
change to placement

                                     A AA
                                A AA                    B



                                                               AA A

                                                               AAA AA
                             A A




Wednesday, November 7, 12                                                 53

Step 2: Initialize a shuffle operation. Nodes randomly exchange ranges.

                   • Range transfers are queued on each host
                   • Hosts initiate transfer of ranges to self
                   • Pay attention to the logs!

Wednesday, November 7, 12                                        54
      Usage: shuffle [options] <sub-command>

       create               Initialize a new shuffle operation
       ls                   List pending relocations
       clear                Clear pending relocations
       en[able]             Enable shuffling
       dis[able]            Disable shuffling

       -dc, --only-dc               Apply only to named DC (create only)
       -tp, --thrift-port           Thrift port number (Default: 9160)
       -p,   --port                 JMX port number (Default: 7199)
       -tf, --thrift-framed         Enable framed transport for Thrift (Default: false)
       -en, --and-enable            Immediately enable shuffling (create only)
       -H,   --help                 Print help information
       -h,   --host                 JMX hostname or IP address (Default: localhost)
       -th, --thrift-host           Thrift hostname or IP address (Default: JMX host)

Wednesday, November 7, 12                                                                 55

Wednesday, November 7, 12                 56




                            Acunu Reflex / Cassandra 1.2   Cassandra 1.1

Wednesday, November 7, 12                                                 57

17 node cluster of EC2 m1.large instances, 460M rows




                            Acunu Reflex / Cassandra 1.2   Cassandra 1.1

Wednesday, November 7, 12                                                 58

17 node cluster of EC2 m1.large instances, 460M rows
The End
         • Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan
             Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan
             Sivasubramanian, Peter Vosshall and Werner Vogels “Dynamo: Amazon’s
             Highly Available Key-value Store” Web.

         • Low, Richard. “Improving Cassandra's uptime with virtual nodes” Web.
         • Overton, Sam. “Virtual Nodes Strategies.” Web.
         • Overton, Sam. “Virtual Nodes: Performance Results.” Web.
         • Jones, Richard. "libketama - a consistent hashing algo for memcache
             clients” Web.

Wednesday, November 7, 12                                                          59

More Related Content

What's hot

Thrift vs Protocol Buffers vs Avro - Biased Comparison
Thrift vs Protocol Buffers vs Avro - Biased ComparisonThrift vs Protocol Buffers vs Avro - Biased Comparison
Thrift vs Protocol Buffers vs Avro - Biased Comparison
Igor Anishchenko
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
Alexey Grishchenko
MyRocks Deep Dive
MyRocks Deep DiveMyRocks Deep Dive
MyRocks Deep Dive
Yoshinori Matsunobu
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to Redis
Dvir Volk
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm
Chandler Huang
Redis persistence in practice
Redis persistence in practiceRedis persistence in practice
Redis persistence in practice
Eugene Fidelin
From cache to in-memory data grid. Introduction to Hazelcast.
From cache to in-memory data grid. Introduction to Hazelcast.From cache to in-memory data grid. Introduction to Hazelcast.
From cache to in-memory data grid. Introduction to Hazelcast.
Taras Matyashovsky
Apache kafka
Apache kafkaApache kafka
Apache kafka
Rahul Jain
InnoDB Locking Explained with Stick Figures
InnoDB Locking Explained with Stick FiguresInnoDB Locking Explained with Stick Figures
InnoDB Locking Explained with Stick Figures
Karwin Software Solutions LLC
Log Structured Merge Tree
Log Structured Merge TreeLog Structured Merge Tree
Log Structured Merge Tree
University of California, Santa Cruz
All Aboard the Databus
All Aboard the DatabusAll Aboard the Databus
All Aboard the Databus
Amy W. Tang
[Meetup] a successful migration from elastic search to clickhouse
[Meetup] a successful migration from elastic search to clickhouse[Meetup] a successful migration from elastic search to clickhouse
[Meetup] a successful migration from elastic search to clickhouse
Apache Phoenix and HBase: Past, Present and Future of SQL over HBase
Apache Phoenix and HBase: Past, Present and Future of SQL over HBaseApache Phoenix and HBase: Past, Present and Future of SQL over HBase
Apache Phoenix and HBase: Past, Present and Future of SQL over HBase
DataWorks Summit/Hadoop Summit
Apache Cassandra at the Geek2Geek Berlin
Apache Cassandra at the Geek2Geek BerlinApache Cassandra at the Geek2Geek Berlin
Apache Cassandra at the Geek2Geek Berlin
Christian Johannsen
Introduction to Apache ZooKeeper
Introduction to Apache ZooKeeperIntroduction to Apache ZooKeeper
Introduction to Apache ZooKeeper
Saurav Haloi
Facebook Messages & HBase
Facebook Messages & HBaseFacebook Messages & HBase
Facebook Messages & HBase
强 王
AWS DynamoDB and Schema Design
AWS DynamoDB and Schema DesignAWS DynamoDB and Schema Design
AWS DynamoDB and Schema Design
Amazon Web Services
RocksDB detail
RocksDB detailRocksDB detail
RocksDB detail
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & libraries
Walaa Hamdy Assy

What's hot (20)

Thrift vs Protocol Buffers vs Avro - Biased Comparison
Thrift vs Protocol Buffers vs Avro - Biased ComparisonThrift vs Protocol Buffers vs Avro - Biased Comparison
Thrift vs Protocol Buffers vs Avro - Biased Comparison
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
MyRocks Deep Dive
MyRocks Deep DiveMyRocks Deep Dive
MyRocks Deep Dive
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to Redis
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm
Redis persistence in practice
Redis persistence in practiceRedis persistence in practice
Redis persistence in practice
From cache to in-memory data grid. Introduction to Hazelcast.
From cache to in-memory data grid. Introduction to Hazelcast.From cache to in-memory data grid. Introduction to Hazelcast.
From cache to in-memory data grid. Introduction to Hazelcast.
Apache kafka
Apache kafkaApache kafka
Apache kafka
InnoDB Locking Explained with Stick Figures
InnoDB Locking Explained with Stick FiguresInnoDB Locking Explained with Stick Figures
InnoDB Locking Explained with Stick Figures
Log Structured Merge Tree
Log Structured Merge TreeLog Structured Merge Tree
Log Structured Merge Tree
All Aboard the Databus
All Aboard the DatabusAll Aboard the Databus
All Aboard the Databus
[Meetup] a successful migration from elastic search to clickhouse
[Meetup] a successful migration from elastic search to clickhouse[Meetup] a successful migration from elastic search to clickhouse
[Meetup] a successful migration from elastic search to clickhouse
Apache Phoenix and HBase: Past, Present and Future of SQL over HBase
Apache Phoenix and HBase: Past, Present and Future of SQL over HBaseApache Phoenix and HBase: Past, Present and Future of SQL over HBase
Apache Phoenix and HBase: Past, Present and Future of SQL over HBase
Apache Cassandra at the Geek2Geek Berlin
Apache Cassandra at the Geek2Geek BerlinApache Cassandra at the Geek2Geek Berlin
Apache Cassandra at the Geek2Geek Berlin
Introduction to Apache ZooKeeper
Introduction to Apache ZooKeeperIntroduction to Apache ZooKeeper
Introduction to Apache ZooKeeper
Facebook Messages & HBase
Facebook Messages & HBaseFacebook Messages & HBase
Facebook Messages & HBase
AWS DynamoDB and Schema Design
AWS DynamoDB and Schema DesignAWS DynamoDB and Schema Design
AWS DynamoDB and Schema Design
RocksDB detail
RocksDB detailRocksDB detail
RocksDB detail
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & libraries

Viewers also liked

Cassandra Virtual Node talk
Cassandra Virtual Node talkCassandra Virtual Node talk
Cassandra Virtual Node talk
Patrick McFadin
Virtual Nodes: Rethinking Topology in Cassandra
Virtual Nodes: Rethinking Topology in CassandraVirtual Nodes: Rethinking Topology in Cassandra
Virtual Nodes: Rethinking Topology in Cassandra
Eric Evans
Wikimedia Content API: A Cassandra Use-case
Wikimedia Content API: A Cassandra Use-caseWikimedia Content API: A Cassandra Use-case
Wikimedia Content API: A Cassandra Use-case
Eric Evans
Wikimedia Content API: A Cassandra Use-case
Wikimedia Content API: A Cassandra Use-caseWikimedia Content API: A Cassandra Use-case
Wikimedia Content API: A Cassandra Use-case
Eric Evans
Castle enhanced Cassandra
Castle enhanced CassandraCastle enhanced Cassandra
Castle enhanced Cassandra
Eric Evans
Webinaire Business&Decision - Trifacta
Webinaire  Business&Decision - TrifactaWebinaire  Business&Decision - Trifacta
Webinaire Business&Decision - Trifacta
Victor Coustenoble
Time Series Data with Apache Cassandra (ApacheCon EU 2014)
Time Series Data with Apache Cassandra (ApacheCon EU 2014)Time Series Data with Apache Cassandra (ApacheCon EU 2014)
Time Series Data with Apache Cassandra (ApacheCon EU 2014)
Eric Evans
Wikimedia Content API (Strangeloop)
Wikimedia Content API (Strangeloop)Wikimedia Content API (Strangeloop)
Wikimedia Content API (Strangeloop)
Eric Evans
Webinar Degetel DataStax
Webinar Degetel DataStaxWebinar Degetel DataStax
Webinar Degetel DataStax
Victor Coustenoble
DataStax et Apache Cassandra pour la gestion des flux IoT
DataStax et Apache Cassandra pour la gestion des flux IoTDataStax et Apache Cassandra pour la gestion des flux IoT
DataStax et Apache Cassandra pour la gestion des flux IoT
Victor Coustenoble
CQL In Cassandra 1.0 (and beyond)
CQL In Cassandra 1.0 (and beyond)CQL In Cassandra 1.0 (and beyond)
CQL In Cassandra 1.0 (and beyond)
Eric Evans
DataStax Enterprise BBL
DataStax Enterprise BBLDataStax Enterprise BBL
DataStax Enterprise BBL
Victor Coustenoble
Cassandra by Example: Data Modelling with CQL3
Cassandra by Example:  Data Modelling with CQL3Cassandra by Example:  Data Modelling with CQL3
Cassandra by Example: Data Modelling with CQL3
Eric Evans
CQL: SQL In Cassandra
CQL: SQL In CassandraCQL: SQL In Cassandra
CQL: SQL In Cassandra
Eric Evans
It's not you, it's me: Ending a 15 year relationship with RRD
It's not you, it's me: Ending a 15 year relationship with RRDIt's not you, it's me: Ending a 15 year relationship with RRD
It's not you, it's me: Ending a 15 year relationship with RRD
Eric Evans
Rethinking Topology In Cassandra (ApacheCon NA)
Rethinking Topology In Cassandra (ApacheCon NA)Rethinking Topology In Cassandra (ApacheCon NA)
Rethinking Topology In Cassandra (ApacheCon NA)
Eric Evans
Time Series Data with Apache Cassandra
Time Series Data with Apache CassandraTime Series Data with Apache Cassandra
Time Series Data with Apache Cassandra
Eric Evans
Lightning fast analytics with Cassandra and Spark
Lightning fast analytics with Cassandra and SparkLightning fast analytics with Cassandra and Spark
Lightning fast analytics with Cassandra and Spark
Victor Coustenoble
Time Series Data with Apache Cassandra
Time Series Data with Apache CassandraTime Series Data with Apache Cassandra
Time Series Data with Apache Cassandra
Eric Evans
Time series storage in Cassandra
Time series storage in CassandraTime series storage in Cassandra
Time series storage in Cassandra
Eric Evans

Viewers also liked (20)

Cassandra Virtual Node talk
Cassandra Virtual Node talkCassandra Virtual Node talk
Cassandra Virtual Node talk
Virtual Nodes: Rethinking Topology in Cassandra
Virtual Nodes: Rethinking Topology in CassandraVirtual Nodes: Rethinking Topology in Cassandra
Virtual Nodes: Rethinking Topology in Cassandra
Wikimedia Content API: A Cassandra Use-case
Wikimedia Content API: A Cassandra Use-caseWikimedia Content API: A Cassandra Use-case
Wikimedia Content API: A Cassandra Use-case
Wikimedia Content API: A Cassandra Use-case
Wikimedia Content API: A Cassandra Use-caseWikimedia Content API: A Cassandra Use-case
Wikimedia Content API: A Cassandra Use-case
Castle enhanced Cassandra
Castle enhanced CassandraCastle enhanced Cassandra
Castle enhanced Cassandra
Webinaire Business&Decision - Trifacta
Webinaire  Business&Decision - TrifactaWebinaire  Business&Decision - Trifacta
Webinaire Business&Decision - Trifacta
Time Series Data with Apache Cassandra (ApacheCon EU 2014)
Time Series Data with Apache Cassandra (ApacheCon EU 2014)Time Series Data with Apache Cassandra (ApacheCon EU 2014)
Time Series Data with Apache Cassandra (ApacheCon EU 2014)
Wikimedia Content API (Strangeloop)
Wikimedia Content API (Strangeloop)Wikimedia Content API (Strangeloop)
Wikimedia Content API (Strangeloop)
Webinar Degetel DataStax
Webinar Degetel DataStaxWebinar Degetel DataStax
Webinar Degetel DataStax
DataStax et Apache Cassandra pour la gestion des flux IoT
DataStax et Apache Cassandra pour la gestion des flux IoTDataStax et Apache Cassandra pour la gestion des flux IoT
DataStax et Apache Cassandra pour la gestion des flux IoT
CQL In Cassandra 1.0 (and beyond)
CQL In Cassandra 1.0 (and beyond)CQL In Cassandra 1.0 (and beyond)
CQL In Cassandra 1.0 (and beyond)
DataStax Enterprise BBL
DataStax Enterprise BBLDataStax Enterprise BBL
DataStax Enterprise BBL
Cassandra by Example: Data Modelling with CQL3
Cassandra by Example:  Data Modelling with CQL3Cassandra by Example:  Data Modelling with CQL3
Cassandra by Example: Data Modelling with CQL3
CQL: SQL In Cassandra
CQL: SQL In CassandraCQL: SQL In Cassandra
CQL: SQL In Cassandra
It's not you, it's me: Ending a 15 year relationship with RRD
It's not you, it's me: Ending a 15 year relationship with RRDIt's not you, it's me: Ending a 15 year relationship with RRD
It's not you, it's me: Ending a 15 year relationship with RRD
Rethinking Topology In Cassandra (ApacheCon NA)
Rethinking Topology In Cassandra (ApacheCon NA)Rethinking Topology In Cassandra (ApacheCon NA)
Rethinking Topology In Cassandra (ApacheCon NA)
Time Series Data with Apache Cassandra
Time Series Data with Apache CassandraTime Series Data with Apache Cassandra
Time Series Data with Apache Cassandra
Lightning fast analytics with Cassandra and Spark
Lightning fast analytics with Cassandra and SparkLightning fast analytics with Cassandra and Spark
Lightning fast analytics with Cassandra and Spark
Time Series Data with Apache Cassandra
Time Series Data with Apache CassandraTime Series Data with Apache Cassandra
Time Series Data with Apache Cassandra
Time series storage in Cassandra
Time series storage in CassandraTime series storage in Cassandra
Time series storage in Cassandra

More from Eric Evans

Cassandra By Example: Data Modelling with CQL3
Cassandra By Example: Data Modelling with CQL3Cassandra By Example: Data Modelling with CQL3
Cassandra By Example: Data Modelling with CQL3
Eric Evans
Cassandra: Not Just NoSQL, It's MoSQL
Cassandra: Not Just NoSQL, It's MoSQLCassandra: Not Just NoSQL, It's MoSQL
Cassandra: Not Just NoSQL, It's MoSQL
Eric Evans
NoSQL Yes, But YesCQL, No?
NoSQL Yes, But YesCQL, No?NoSQL Yes, But YesCQL, No?
NoSQL Yes, But YesCQL, No?
Eric Evans
Cassandra Explained
Cassandra ExplainedCassandra Explained
Cassandra Explained
Eric Evans
Cassandra Explained
Cassandra ExplainedCassandra Explained
Cassandra Explained
Eric Evans
Outside The Box With Apache Cassnadra
Outside The Box With Apache CassnadraOutside The Box With Apache Cassnadra
Outside The Box With Apache Cassnadra
Eric Evans
The Cassandra Distributed Database
The Cassandra Distributed DatabaseThe Cassandra Distributed Database
The Cassandra Distributed Database
Eric Evans
An Introduction To Cassandra
An Introduction To CassandraAn Introduction To Cassandra
An Introduction To Cassandra
Eric Evans
Cassandra In A Nutshell
Cassandra In A NutshellCassandra In A Nutshell
Cassandra In A Nutshell
Eric Evans

More from Eric Evans (9)

Cassandra By Example: Data Modelling with CQL3
Cassandra By Example: Data Modelling with CQL3Cassandra By Example: Data Modelling with CQL3
Cassandra By Example: Data Modelling with CQL3
Cassandra: Not Just NoSQL, It's MoSQL
Cassandra: Not Just NoSQL, It's MoSQLCassandra: Not Just NoSQL, It's MoSQL
Cassandra: Not Just NoSQL, It's MoSQL
NoSQL Yes, But YesCQL, No?
NoSQL Yes, But YesCQL, No?NoSQL Yes, But YesCQL, No?
NoSQL Yes, But YesCQL, No?
Cassandra Explained
Cassandra ExplainedCassandra Explained
Cassandra Explained
Cassandra Explained
Cassandra ExplainedCassandra Explained
Cassandra Explained
Outside The Box With Apache Cassnadra
Outside The Box With Apache CassnadraOutside The Box With Apache Cassnadra
Outside The Box With Apache Cassnadra
The Cassandra Distributed Database
The Cassandra Distributed DatabaseThe Cassandra Distributed Database
The Cassandra Distributed Database
An Introduction To Cassandra
An Introduction To CassandraAn Introduction To Cassandra
An Introduction To Cassandra
Cassandra In A Nutshell
Cassandra In A NutshellCassandra In A Nutshell
Cassandra In A Nutshell

Recently uploaded

TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-InTrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
Measuring the Impact of Network Latency at Twitter
Measuring the Impact of Network Latency at TwitterMeasuring the Impact of Network Latency at Twitter
Measuring the Impact of Network Latency at Twitter
Mitigating the Impact of State Management in Cloud Stream Processing Systems
Mitigating the Impact of State Management in Cloud Stream Processing SystemsMitigating the Impact of State Management in Cloud Stream Processing Systems
Mitigating the Impact of State Management in Cloud Stream Processing Systems
What's New in Copilot for Microsoft365 May 2024.pptx
What's New in Copilot for Microsoft365 May 2024.pptxWhat's New in Copilot for Microsoft365 May 2024.pptx
What's New in Copilot for Microsoft365 May 2024.pptx
Stephanie Beckett
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdfWhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
Best Practices for Effectively Running dbt in Airflow.pdf
Best Practices for Effectively Running dbt in Airflow.pdfBest Practices for Effectively Running dbt in Airflow.pdf
Best Practices for Effectively Running dbt in Airflow.pdf
Tatiana Al-Chueyr
Choose our Linux Web Hosting for a seamless and successful online presence
Choose our Linux Web Hosting for a seamless and successful online presenceChoose our Linux Web Hosting for a seamless and successful online presence
Choose our Linux Web Hosting for a seamless and successful online presence
Comparison Table of DiskWarrior Alternatives.pdf
Comparison Table of DiskWarrior Alternatives.pdfComparison Table of DiskWarrior Alternatives.pdf
Comparison Table of DiskWarrior Alternatives.pdf
Andrey Yasko
20240702 Présentation Plateforme GenAI.pdf
20240702 Présentation Plateforme GenAI.pdf20240702 Présentation Plateforme GenAI.pdf
20240702 Présentation Plateforme GenAI.pdf
Sally Laouacheria
Quantum Communications Q&A with Gemini LLM
Quantum Communications Q&A with Gemini LLMQuantum Communications Q&A with Gemini LLM
Quantum Communications Q&A with Gemini LLM
Vijayananda Mohire
20240705 QFM024 Irresponsible AI Reading List June 2024
20240705 QFM024 Irresponsible AI Reading List June 202420240705 QFM024 Irresponsible AI Reading List June 2024
20240705 QFM024 Irresponsible AI Reading List June 2024
Matthew Sinclair
Pigging Solutions Sustainability brochure.pdf
Pigging Solutions Sustainability brochure.pdfPigging Solutions Sustainability brochure.pdf
Pigging Solutions Sustainability brochure.pdf
Pigging Solutions
DealBook of Ukraine: 2024 edition
DealBook of Ukraine: 2024 editionDealBook of Ukraine: 2024 edition
DealBook of Ukraine: 2024 edition
Yevgen Sysoyev
Implementations of Fused Deposition Modeling in real world
Implementations of Fused Deposition Modeling  in real worldImplementations of Fused Deposition Modeling  in real world
Implementations of Fused Deposition Modeling in real world
Emerging Tech
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
Kief Morris
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - Mydbops
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - MydbopsScaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - Mydbops
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - Mydbops
Manual | Product | Research Presentation
Manual | Product | Research PresentationManual | Product | Research Presentation
Manual | Product | Research Presentation
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyyActive Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy

Recently uploaded (20)

TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-InTrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
Measuring the Impact of Network Latency at Twitter
Measuring the Impact of Network Latency at TwitterMeasuring the Impact of Network Latency at Twitter
Measuring the Impact of Network Latency at Twitter
Mitigating the Impact of State Management in Cloud Stream Processing Systems
Mitigating the Impact of State Management in Cloud Stream Processing SystemsMitigating the Impact of State Management in Cloud Stream Processing Systems
Mitigating the Impact of State Management in Cloud Stream Processing Systems
What's New in Copilot for Microsoft365 May 2024.pptx
What's New in Copilot for Microsoft365 May 2024.pptxWhat's New in Copilot for Microsoft365 May 2024.pptx
What's New in Copilot for Microsoft365 May 2024.pptx
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdfWhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
Best Practices for Effectively Running dbt in Airflow.pdf
Best Practices for Effectively Running dbt in Airflow.pdfBest Practices for Effectively Running dbt in Airflow.pdf
Best Practices for Effectively Running dbt in Airflow.pdf
Choose our Linux Web Hosting for a seamless and successful online presence
Choose our Linux Web Hosting for a seamless and successful online presenceChoose our Linux Web Hosting for a seamless and successful online presence
Choose our Linux Web Hosting for a seamless and successful online presence
Comparison Table of DiskWarrior Alternatives.pdf
Comparison Table of DiskWarrior Alternatives.pdfComparison Table of DiskWarrior Alternatives.pdf
Comparison Table of DiskWarrior Alternatives.pdf
20240702 Présentation Plateforme GenAI.pdf
20240702 Présentation Plateforme GenAI.pdf20240702 Présentation Plateforme GenAI.pdf
20240702 Présentation Plateforme GenAI.pdf
Quantum Communications Q&A with Gemini LLM
Quantum Communications Q&A with Gemini LLMQuantum Communications Q&A with Gemini LLM
Quantum Communications Q&A with Gemini LLM
20240705 QFM024 Irresponsible AI Reading List June 2024
20240705 QFM024 Irresponsible AI Reading List June 202420240705 QFM024 Irresponsible AI Reading List June 2024
20240705 QFM024 Irresponsible AI Reading List June 2024
Pigging Solutions Sustainability brochure.pdf
Pigging Solutions Sustainability brochure.pdfPigging Solutions Sustainability brochure.pdf
Pigging Solutions Sustainability brochure.pdf
DealBook of Ukraine: 2024 edition
DealBook of Ukraine: 2024 editionDealBook of Ukraine: 2024 edition
DealBook of Ukraine: 2024 edition
Implementations of Fused Deposition Modeling in real world
Implementations of Fused Deposition Modeling  in real worldImplementations of Fused Deposition Modeling  in real world
Implementations of Fused Deposition Modeling in real world
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - Mydbops
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - MydbopsScaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - Mydbops
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - Mydbops
Manual | Product | Research Presentation
Manual | Product | Research PresentationManual | Product | Research Presentation
Manual | Product | Research Presentation
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyyActive Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy

Virtual Nodes: Rethinking Topology in Cassandra

  • 1. Rethinking Topology in Cassandra ApacheCon Europe November 7, 2012 Eric Evans @jericevans Wednesday, November 7, 12 1
  • 3. DHT 101 partitioning Z A Wednesday, November 7, 12 3 The keyspace, a namespace encompassing all possible keys
  • 4. DHT 101 partitioning Z A Y B C Wednesday, November 7, 12 4 The namespace is divided into N partitions (where N is the number of nodes). Partitions are mapped to nodes and placed evenly throughout the namespace.
  • 5. DHT 101 partitioning Z A Y Key = Aaa B C Wednesday, November 7, 12 5 A record, stored by key, is positioned on the next node (working clockwise) from where it sorts in the namespace
  • 6. DHT 101 replica placement Z A Y Key = Aaa B C Wednesday, November 7, 12 6 Additional copies (replicas) are stored on other nodes. Commonly the next N-1 nodes, but anything deterministic will work.
  • 7. DHT 101 consistency Consistency Availability Partition tolerance Wednesday, November 7, 12 7 With multiple copies comes a set of trade-offs commonly articulated using the CAP theorem; At any given point, we can only guarantee 2 of Consistency, Availability, and Partition tolerance.
  • 8. DHT 101 scenario: consistency level = one A W ? ? Wednesday, November 7, 12 8 Writing at consistency level ONE provides very high availability, only one in 3 member nodes need be up for write to succeed
  • 9. DHT 101 scenario: consistency level = all A R ? ? Wednesday, November 7, 12 9 If strong consistency is required, reads with consistency ALL can be used of writes performed at ONE. The trade-off is in availability, all 3 member nodes must be up, else the read fails.
  • 10. DHT 101 scenario: quorum write A W R+W > N B ? Wednesday, November 7, 12 10 Using QUORUM consistency, we only require floor((N/2)+1) nodes.
  • 11. DHT 101 scenario: quorum read ? R+W > N B R C Wednesday, November 7, 12 11 Using QUORUM consistency, we only require floor((N/2)+1) nodes.
  • 14. Problem: Poor load distribution Wednesday, November 7, 12 14
  • 15. Distributing Load Z A Y B C M Wednesday, November 7, 12 15 B and C hold replicas of A
  • 16. Distributing Load Z A Y B C M Wednesday, November 7, 12 16 A and B hold replicas of Z
  • 17. Distributing Load Z A Y B C M Wednesday, November 7, 12 17 Z and A hold replicas of Y
  • 18. Distributing Load Z A Y B C M Wednesday, November 7, 12 18 Disaster strikes!
  • 19. Distributing Load Z A Y B C M Wednesday, November 7, 12 19 Sets [Y,Z,A], [Z,A,B], [A,B,C] all suffer the loss of A; Results in extra load on neighboring nodes
  • 20. Distributing Load Z A A1 Y B C M Wednesday, November 7, 12 20 Solution: Replace/repair down node
  • 21. Distributing Load Z A A1 Y B C M Wednesday, November 7, 12 21 Solution: Replace/repair down node
  • 22. Distributing Load Z A A1 Y B C M Wednesday, November 7, 12 22 Neighboring nodes are needed to stream missing data to A; Results in even more load on neighboring nodes
  • 23. Problem: Poor data distribution Wednesday, November 7, 12 23
  • 24. Distributing Data A C D B Wednesday, November 7, 12 24 Ideal distribution of keyspace
  • 25. Distributing Data A E C D B Wednesday, November 7, 12 25 Bootstrapping a node, bisecting one partition; Distribution is no longer ideal
  • 26. Distributing Data A A E C D C D B B Wednesday, November 7, 12 26 Moving existing nodes means moving corresponding data; Not ideal
  • 27. Distributing Data A A E C D C D B B Wednesday, November 7, 12 27 Moving existing nodes means moving corresponding data; Not ideal
  • 28. Distributing Data A H E C D G F B Wednesday, November 7, 12 28 Frequently cited alternative: Double the size of your cluster, bisecting all ranges
  • 29. Distributing Data A H E C D G F B Wednesday, November 7, 12 29 Frequently cited alternative: Double the size of your cluster, bisecting all ranges
  • 31. In a nutshell... host host host Wednesday, November 7, 12 31 Basically: “nodes” on the ring are virtual, and many of them are mapped to each “real” node (host)
  • 32. Benefits • Operationally simpler (no token management) • Better distribution of load • Concurrent streaming involving all hosts • Smaller partitions mean greater reliability • Supports heterogenous hardware Wednesday, November 7, 12 32
  • 33. Strategies • Automatic sharding • Fixed partition assignment • Random token assignment Wednesday, November 7, 12 33
  • 34. Strategy Automatic Sharding • Partitions are split when data exceeds a threshold • Newly created partitions are relocated to a host with lower data load • Similar to sharding performed by Bigtable, or Mongo auto-sharding Wednesday, November 7, 12 34
  • 35. Strategy Fixed Partition Assignment • Namespace divided into Q evenly-sized partitions • Q/N partitions assigned per host (where N is the number of hosts) • Joining hosts “steal” partitions evenly from existing hosts. • Used by Dynamo and Voldemort (described in Dynamo paper as “strategy 3”) Wednesday, November 7, 12 35
  • 36. Strategy Random Token Assignment • Each host assigned T random tokens • T random tokens generated for joining hosts; New tokens divide existing ranges • Similar to libketama; Identical to Classic Cassandra when T=1 Wednesday, November 7, 12 36
  • 37. Considerations 1. Number of partitions 2. Partition size 3. How 1 changes with more nodes and data 4. How 2 changes with more nodes and data Wednesday, November 7, 12 37
  • 38. Evaluating Strategy No. Partitions Partition size Random O(N) O(B/N) Fixed O(1) O(B) Auto-sharding O(B) O(1) B ~ total data size, N ~ number of hosts Wednesday, November 7, 12 38
  • 39. Evaluating • Automatic sharding • partition size constant (great) • number of partitions scales linearly with data size (bad) • Fixed partition assignment • Random token assignment Wednesday, November 7, 12 39
  • 40. Evaluating • Automatic sharding • Fixed partition assignment • Number of partitions is constant (good) • Partition size scales linearly with data size (bad) • Higher operational complexity (bad) • Random token assignment Wednesday, November 7, 12 40
  • 41. Evaluating • Automatic sharding • Fixed partition assignment • Random token assignment • Number of partitions scales linearly with number of hosts (good ok) • Partition size increases with more data; decreases with more hosts (good) Wednesday, November 7, 12 41
  • 42. Evaluating • Automatic sharding • Fixed partition assignment • Random token assignment Wednesday, November 7, 12 42
  • 44. Configuration conf/cassandra.yaml # Comma separated list of tokens, # (new installs only). initial_token:<token>,<token>,<token> or # Number of tokens to generate. num_tokens: 256 Wednesday, November 7, 12 44 Two params control how tokens are assigned. The initial_token param now optionally accepts a csv list, or (preferably) you can assign a numeric value to num_tokens
  • 45. Configuration nodetool info Token : (invoke with -T/--tokens to see all 256 tokens) ID : 64090651-6034-41d5-bfc6-ddd24957f164 Gossip active : true Thrift active : true Load : 92.69 KB Generation No : 1351030018 Uptime (seconds): 45 Heap Memory (MB): 95.16 / 1956.00 Data Center : datacenter1 Rack : rack1 Exceptions : 0 Key Cache : size 240 (bytes), capacity 101711872 (bytes ... Row Cache : size 0 (bytes), capacity 0 (bytes), 0 hits, ... Wednesday, November 7, 12 45 To keep the output readable, nodetool info no longer displays tokens (if there are more than one), unless the -T/--tokens argument is passed
  • 46. Configuration nodetool ring Datacenter: datacenter1 ========== Replicas: 2 Address Rack Status State Load Owns Token 9022770486425350384 rack1 Up Normal 97.24 KB 66.03% -9182469192098976078 rack1 Up Normal 97.24 KB 66.03% -9054823614314102214 rack1 Up Normal 97.24 KB 66.03% -8970752544645156769 rack1 Up Normal 97.24 KB 66.03% -8927190060345427739 rack1 Up Normal 97.24 KB 66.03% -8880475677109843259 rack1 Up Normal 97.24 KB 66.03% -8817876497520861779 rack1 Up Normal 97.24 KB 66.03% -8810512134942064901 rack1 Up Normal 97.24 KB 66.03% -8661764562509480261 rack1 Up Normal 97.24 KB 66.03% -8641550925069186492 rack1 Up Normal 97.24 KB 66.03% -8636224350654790732 ... ... Wednesday, November 7, 12 46 nodetool ring is still there, but the output is significantly more verbose, and it is less useful as the go-to
  • 47. Configuration nodetool status Datacenter: datacenter1 ======================= Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns Host ID Rack UN 97.2 KB 256 66.0% 64090651-6034-41d5-bfc6-ddd24957f164 rack1 UN 92.7 KB 256 66.2% b3c3b03c-9202-4e7b-811a-9de89656ec4c rack1 UN 92.6 KB 256 67.7% e4eef159-cb77-4627-84c4-14efbc868082 rack1 Wednesday, November 7, 12 47 New go-to command is nodetool status
  • 48. Configuration nodetool status Datacenter: datacenter1 ======================= Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns Host ID Rack UN 97.2 KB 256 66.0% 64090651-6034-41d5-bfc6-ddd24957f164 rack1 UN 92.7 KB 256 66.2% b3c3b03c-9202-4e7b-811a-9de89656ec4c rack1 UN 92.6 KB 256 67.7% e4eef159-cb77-4627-84c4-14efbc868082 rack1 Wednesday, November 7, 12 48 Of note, since it is no longer practical to name a host by it’s token (because it can have many), each host has a unique ID
  • 49. Configuration nodetool status Datacenter: datacenter1 ======================= Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns Host ID Rack UN 97.2 KB 256 66.0% 64090651-6034-41d5-bfc6-ddd24957f164 rack1 UN 92.7 KB 256 66.2% b3c3b03c-9202-4e7b-811a-9de89656ec4c rack1 UN 92.6 KB 256 67.7% e4eef159-cb77-4627-84c4-14efbc868082 rack1 Wednesday, November 7, 12 49 Note the token per-node count
  • 50. Migration A C B Wednesday, November 7, 12 50
  • 51. Migration edit conf/cassandra.yaml and restart # Number of tokens to generate. num_tokens: 256 Wednesday, November 7, 12 51 Step 1: Set num_tokens in cassandra.yaml, and restart node
  • 52. Migration convert to T contiguous tokens in existing ranges A AA A AA B A A AA A A AA A A A A AAA AA A A A C A A A A A A A A Wednesday, November 7, 12 52 This will cause the existing range to be split into T contiguous tokens. This results in no change to placement
  • 53. Migration shuffle A AA A AA B A A AA A A AA A A A A AAA AA A A A C A A A A A A A A Wednesday, November 7, 12 53 Step 2: Initialize a shuffle operation. Nodes randomly exchange ranges.
  • 54. Shuffle • Range transfers are queued on each host • Hosts initiate transfer of ranges to self • Pay attention to the logs! Wednesday, November 7, 12 54
  • 55. Shuffle bin/shuffle Usage: shuffle [options] <sub-command> Sub-commands: create Initialize a new shuffle operation ls List pending relocations clear Clear pending relocations en[able] Enable shuffling dis[able] Disable shuffling Options: -dc, --only-dc Apply only to named DC (create only) -tp, --thrift-port Thrift port number (Default: 9160) -p, --port JMX port number (Default: 7199) -tf, --thrift-framed Enable framed transport for Thrift (Default: false) -en, --and-enable Immediately enable shuffling (create only) -H, --help Print help information -h, --host JMX hostname or IP address (Default: localhost) -th, --thrift-host Thrift hostname or IP address (Default: JMX host) Wednesday, November 7, 12 55
  • 57. removenode 400 300 200 100 0 Acunu Reflex / Cassandra 1.2 Cassandra 1.1 Wednesday, November 7, 12 57 17 node cluster of EC2 m1.large instances, 460M rows
  • 58. bootstrap 500 375 250 125 0 Acunu Reflex / Cassandra 1.2 Cassandra 1.1 Wednesday, November 7, 12 58 17 node cluster of EC2 m1.large instances, 460M rows
  • 59. The End • Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall and Werner Vogels “Dynamo: Amazon’s Highly Available Key-value Store” Web. • Low, Richard. “Improving Cassandra's uptime with virtual nodes” Web. • Overton, Sam. “Virtual Nodes Strategies.” Web. • Overton, Sam. “Virtual Nodes: Performance Results.” Web. • Jones, Richard. "libketama - a consistent hashing algo for memcache clients” Web. Wednesday, November 7, 12 59