Virtual Nodes: Rethinking Topology in Cassandra

Rethinking Topology in Cassandra

ApacheCon Europe
November 7, 2012

Eric Evans
eevans@acunu.com
@jericevans

Wednesday, November 7, 12 1

DHT 101


DHT 101
partitioning
Z A


The keyspace, a namespace encompassing all possible keys

DHT 101
partitioning

Z A

Y B

C


The namespace is divided into N partitions (where N is the number of nodes). Partitions are
mapped to nodes and placed evenly throughout the namespace.

DHT 101
partitioning

Z A

Y Key = Aaa B

C


A record, stored by key, is positioned on the next node (working clockwise) from where it
sorts in the namespace

DHT 101
replica placement

Z A

Y Key = Aaa B

C


Additional copies (replicas) are stored on other nodes. Commonly the next N-1 nodes, but
anything deterministic will work.

DHT 101
consistency

Consistency
Availability
Partition tolerance


With multiple copies comes a set of trade-offs commonly articulated using the CAP theorem;
At any given point, we can only guarantee 2 of Consistency, Availability, and Partition
tolerance.

DHT 101
scenario: consistency level = one

A
W

?

?


Writing at consistency level ONE provides very high availability, only one in 3 member nodes
need be up for write to succeed

DHT 101
scenario: consistency level = all

A
R

?

?


If strong consistency is required, reads with consistency ALL can be used of writes performed
at ONE. The trade-off is in availability, all 3 member nodes must be up, else the read fails.

DHT 101
scenario: quorum write

A
W

R+W > N B

?


Using QUORUM consistency, we only require ﬂoor((N/2)+1) nodes.

DHT 101
scenario: quorum read

?

R+W > N B

R
C


Using QUORUM consistency, we only require ﬂoor((N/2)+1) nodes.

Awesome, yes?


Well...


Problem:
Poor load distribution


Distributing Load

Z A

Y B

C
M


B and C hold replicas of A

Distributing Load

Z A

Y B

C
M


A and B hold replicas of Z

Distributing Load

Z A

Y B

C
M


Z and A hold replicas of Y

Distributing Load

Z A

Y B

C
M


Disaster strikes!

Distributing Load

Z A

Y B

C
M


Sets [Y,Z,A], [Z,A,B], [A,B,C] all suffer the loss of A; Results in extra load on neighboring
nodes

Distributing Load

Z A
A1

Y B

C
M


Solution: Replace/repair down node

Distributing Load

Z A
A1

Y B

C
M


Neighboring nodes are needed to stream missing data to A; Results in even more load on
neighboring nodes

Problem:
Poor data distribution


Distributing Data
A

C
D

B


Ideal distribution of keyspace

Distributing Data
A
E

C
D

B


Bootstrapping a node, bisecting one partition; Distribution is no longer ideal

Distributing Data
A A
E

C
D
C
D

B B


Moving existing nodes means moving corresponding data; Not ideal

Distributing Data
A
H E

C
D

G F
B


Frequently cited alternative: Double the size of your cluster, bisecting all ranges

Virtual Nodes


In a nutshell...

host

host

host


Basically: “nodes” on the ring are virtual, and many of them are mapped to each “real” node
(host)

Beneﬁts
• Operationally simpler (no token
management)
• Better distribution of load
• Concurrent streaming involving all hosts
• Smaller partitions mean greater reliability
• Supports heterogenous hardware


Strategies

• Automatic sharding
• Fixed partition assignment
• Random token assignment


Strategy
Automatic Sharding

• Partitions are split when data exceeds a
threshold
• Newly created partitions are relocated to a
host with lower data load
• Similar to sharding performed by Bigtable,
or Mongo auto-sharding


Strategy
Fixed Partition Assignment

• Namespace divided into Q evenly-sized
partitions
• Q/N partitions assigned per host (where N
is the number of hosts)
• Joining hosts “steal” partitions evenly from
existing hosts.
• Used by Dynamo and Voldemort (described
in Dynamo paper as “strategy 3”)


Strategy
Random Token Assignment

• Each host assigned T random tokens
• T random tokens generated for joining
hosts; New tokens divide existing ranges
• Similar to libketama; Identical to Classic
Cassandra when T=1


Considerations

1. Number of partitions
2. Partition size
3. How 1 changes with more nodes and data
4. How 2 changes with more nodes and data


Evaluating
Strategy No. Partitions Partition size

Random O(N) O(B/N)

Fixed O(1) O(B)

Auto-sharding O(B) O(1)

B ~ total data size, N ~ number of hosts


Evaluating
• partition size constant (great)
• number of partitions scales linearly with
data size (bad)


Evaluating
• Number of partitions is constant (good)
• Partition size scales linearly with data size
(bad)
• Higher operational complexity (bad)


Evaluating
• Number of partitions scales linearly with
number of hosts (good ok)
• Partition size increases with more data;
decreases with more hosts (good)


Evaluating



Cassandra


Conﬁguration
conf/cassandra.yaml

# Comma separated list of tokens,
# (new installs only).
initial_token:<token>,<token>,<token>

or

# Number of tokens to generate.
num_tokens: 256


Two params control how tokens are assigned. The initial_token param now optionally
accepts a csv list, or (preferably) you can assign a numeric value to num_tokens

Conﬁguration
nodetool info

Token : (invoke with -T/--tokens to see all 256 tokens)
ID : 64090651-6034-41d5-bfc6-ddd24957f164
Gossip active : true
Thrift active : true
Load : 92.69 KB
Generation No : 1351030018
Uptime (seconds): 45
Heap Memory (MB): 95.16 / 1956.00
Data Center : datacenter1
Rack : rack1
Exceptions : 0
Key Cache : size 240 (bytes), capacity 101711872 (bytes ...
Row Cache : size 0 (bytes), capacity 0 (bytes), 0 hits, ...


To keep the output readable, nodetool info no longer displays tokens (if there are more than
one), unless the -T/--tokens argument is passed

Conﬁguration
nodetool ring
Datacenter: datacenter1
==========
Replicas: 2

Address Rack Status State Load Owns Token
9022770486425350384
127.0.0.1 rack1 Up Normal 97.24 KB 66.03% -9182469192098976078
127.0.0.1 rack1 Up Normal 97.24 KB 66.03% -9054823614314102214
127.0.0.1 rack1 Up Normal 97.24 KB 66.03% -8970752544645156769
127.0.0.1 rack1 Up Normal 97.24 KB 66.03% -8927190060345427739
127.0.0.1 rack1 Up Normal 97.24 KB 66.03% -8880475677109843259
127.0.0.1 rack1 Up Normal 97.24 KB 66.03% -8817876497520861779
127.0.0.1 rack1 Up Normal 97.24 KB 66.03% -8810512134942064901
127.0.0.1 rack1 Up Normal 97.24 KB 66.03% -8661764562509480261
127.0.0.1 rack1 Up Normal 97.24 KB 66.03% -8641550925069186492
127.0.0.1 rack1 Up Normal 97.24 KB 66.03% -8636224350654790732
...
...


nodetool ring is still there, but the output is signiﬁcantly more verbose, and it is less useful
as the go-to

Conﬁguration
nodetool status

=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack
UN 10.0.0.1 97.2 KB 256 66.0% 64090651-6034-41d5-bfc6-ddd24957f164 rack1
UN 10.0.0.2 92.7 KB 256 66.2% b3c3b03c-9202-4e7b-811a-9de89656ec4c rack1
UN 10.0.0.3 92.6 KB 256 67.7% e4eef159-cb77-4627-84c4-14efbc868082 rack1


New go-to command is nodetool status

Conﬁguration
nodetool status

=======================
Status=Up/Down


Of note, since it is no longer practical to name a host by it’s token (because it can have
many), each host has a unique ID

Conﬁguration
nodetool status

=======================
Status=Up/Down


Note the token per-node count

Migration
A

C B


Migration
edit conf/cassandra.yaml and restart

# Number of tokens to generate.
num_tokens: 256


Step 1: Set num_tokens in cassandra.yaml, and restart node

Migration
convert to T contiguous tokens in existing ranges

A AA
A AA B

A
A

AA
A
A

AA A
A
A
A

AAA AA
A A
A

C
A
A

A
A

A
A
A
A


This will cause the existing range to be split into T contiguous tokens. This results in no
change to placement

Migration
shufﬂe

A AA
A AA B

A
A

AA
A
A

AA A
A
A
A

AAA AA
A A
A

C
A
A

A
A

A
A
A
A


Step 2: Initialize a shuffle operation. Nodes randomly exchange ranges.

Shufﬂe

• Range transfers are queued on each host
• Hosts initiate transfer of ranges to self
• Pay attention to the logs!


Shufﬂe
bin/shufﬂe
Usage: shuffle [options] <sub-command>

Sub-commands:
create Initialize a new shuffle operation
ls List pending relocations
clear Clear pending relocations
en[able] Enable shuffling
dis[able] Disable shuffling

Options:
-dc, --only-dc Apply only to named DC (create only)
-tp, --thrift-port Thrift port number (Default: 9160)
-p, --port JMX port number (Default: 7199)
-tf, --thrift-framed Enable framed transport for Thrift (Default: false)
-en, --and-enable Immediately enable shuffling (create only)
-H, --help Print help information
-h, --host JMX hostname or IP address (Default: localhost)
-th, --thrift-host Thrift hostname or IP address (Default: JMX host)


Performance


removenode
400

300

200

100

0
Acunu Reﬂex / Cassandra 1.2 Cassandra 1.1


17 node cluster of EC2 m1.large instances, 460M rows

bootstrap
500

375

250

125

0
Acunu Reﬂex / Cassandra 1.2 Cassandra 1.1


17 node cluster of EC2 m1.large instances, 460M rows

The End
• Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan
Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan
Sivasubramanian, Peter Vosshall and Werner Vogels “Dynamo: Amazon’s
Highly Available Key-value Store” Web.

• Low, Richard. “Improving Cassandra's uptime with virtual nodes” Web.
• Overton, Sam. “Virtual Nodes Strategies.” Web.
• Overton, Sam. “Virtual Nodes: Performance Results.” Web.
• Jones, Richard. "libketama - a consistent hashing algo for memcache
clients” Web.


Virtual Nodes: Rethinking Topology in Cassandra

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

More from Eric Evans

More from Eric Evans (9)

Recently uploaded

Recently uploaded (20)

Virtual Nodes: Rethinking Topology in Cassandra