1

When trying to replace a dead node, the replacement node fails after a long time with an OOM error. I believe the issue is linked to secondary indexes on a table. I can see that on the new nodes /var/lib/cassandra/<keyspace>/<table>/.<secondary_index> directory that there are hundreds of thousands of files. To make matters worse, there are a couple of secondary indexes. I've tried increasing node memory/heap, but it doesn't seem to matter, the node will not come online. At this point the failed node has been offline for longer than gc_grace_time.

If I run nodetool removenode, will this same issue get spread to other nodes? If I add a new node, run nodetool assasinate, then repair, will it spread this issue to other nodes?

Thoughts on how best to get a replacement node into the cluster and operational?

I'm sure there are design issues with the DB structure - that is out of my control. I'm simply trying to help a team get their cluster back fully operational.

Update with more info: The old node failed during a repair, and would not come online due to a too many open files error. The configured open files on the host is already at the linux/OS MAX per best practice.

At that point, we attempted to start a replacement node (empty data dirs) with the JVM replace_address_first_boot setting. This did work and a new node came online, but there was a repair job on a large table with secondary indexes that never progressed (see above regarding hundreds of thousands of files present). The repair progress stayed at 0 and eventually java crashed with an OOM error.

We then truncated this table as it wasn't needed for this particular keyspace. At this point the "new" node is in the cluster but down. We then tried to replace the replaced node with the replace_address_first_boot setting, using the original IP (wiped data dirs), and it's failing to come online. It will startup and stream data from the other nodes for ~20 hours, then slow to a crawl as it writes hundreds of thousands of files for the same table in another keyspace (one we cannot drop/truncate), before eventually going OOM with an OOM:

# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (malloc) failed to allocate 4088 bytes for AllocateHeap
# Possible reasons:
#   The system is out of physical RAM or swap space
# Possible solutions:
#   Reduce memory load on the system
#   Increase physical memory or swap space
#   Check if swap backing store is full
#   Decrease Java heap size (-Xmx/-Xms)
#   Decrease number of Java threads
#   Decrease Java thread stack sizes (-Xss)
#   Set larger code cache with -XX:ReservedCodeCacheSize=
# This output file may be truncated or incomplete.
#
#  Out of Memory Error (allocation.inline.hpp:61), pid=31868, tid=0x00007f2297272700

The nodes are 80Gb memory w/40Gb heap using the defaults for G1GC. We've tried increase memory up to 128Gb (leaving heap alone), with no change.

It feels like it may be best to assasinate this node and simply add a new node. This will require running repairs across all keyspaces from my understanding? As the original issue started with a repair (after deleting a bunch of data), the concern is that we don't want to introduce further issues into the database.

2 Answers 2

1

Let's break down the differences among replacing a dead node, removing a node, and assassinating it.

Key Concepts:

  • Replacing a Dead Node:

    This process involves bringing up a new node to take over the token ranges of a dead node. The replacement node should have the same IP address as the dead node (or the appropriate configuration adjustments if using a different IP). The new node streams data from other nodes to ensure it has the correct data for its token ranges.

  • Removing a Node (nodetool removenode):

    This command removes a dead node's token ranges from the cluster. The remaining nodes in the cluster redistribute the data previously owned by the removed node. It is generally safe and won't spread issues unless the data itself is problematic.

  • Assassinating a Node (nodetool assassinate):

    This command forcefully removes a node from the cluster's metadata. Typically used when a node has been dead for a long time and nodetool removenode doesn't work. Assassinating a node doesn't involve data redistribution.

You didn't write much about your issue but this is how I would replace the node:

  1. Stop the Old Node:

    If the old node is still running, ensure it is completely stopped.

  2. Delete All Data:

    On the node you plan to replace, delete all the existing data to ensure a clean state. Run:

    rm -fr /data/cassandra/commitlog/*
    rm -fr /data/cassandra/hints/*
    rm -fr /data/cassandra/data/*
    rm -fr /data/cassandra/saved_caches/*
    

    (adjust the path for your data directory is different, i have a /data disk in this case).

  3. Update cassandra setting:

    Set the replace_address_first_boot parameter to the IP address of the dead node. For example: add the following line at the end of your cassandra-env.sh file:

    JVM_OPTS="$JVM_OPTS -Dcassandra.replace_address=<address_of_dead_node>"
    
  4. Ensure auto_bootstrap is set to true in cassandra.yaml (this is the default value).

  5. Start Cassandra on the Replacement Node.

The replacement node will now bootstrap and stream data from its peers to take over the token ranges of the dead node.

If the replace fails, I would remove/assassinate the node, fix the issue with the index (if it is possible) and add a new node.

2
  • I added some additional info above. I suspect that if I assassinate the node and add a new one that would get me back to 6 in the cluster. However, I suspect initiating repairs on the keyspace with the secondary indexes will lead to the same issue I'm having with bootstrapping a replacement. Thoughts?
    – John Koyle
    Commented Jun 19 at 14:02
  • Secondary indexes in Cassandra can create a large number of small SSTables, which exacerbates the problem during repairs and replacements. How many of them do you still have in your cluster? We usually avoid using them. Can you drop them or we should try to fix the cluster with the secondary index?
    – Mario
    Commented Jun 28 at 17:19
1

It's not clear from your post how you are replacing the dead node. For contributors to be able to help you in a meaningful way, it would be great if you provide clear details of what you've attempted, what worked vs what didn't, and any relevant full error message + full stack trace.

In particular, you need to provide details of whether you're replacing the node with the "replace address" method (replace_address_first_boot JVM flag) or mounted the data/ volume onto a new server.

Either way, you should NOT decommission the node with removenode because you will not be able to replace it with either the "replace" flag or mounting the data on another node because you will have effectively kicked the node out of the cluster so it will no longer be replaceable. On the same token, assassinating the node will also make it irreplaceable.

When you decommission (remove) a node from a cluster, the token(s) it used to own will be taken over by neighbouring nodes and data will get shuffled around the cluster. This is an expensive and unnecessary operation particularly if your ultimate goal is to replace the node with a new server.

Once the dead node is decommissioned, bootstrapping another server will simply add a new node to the cluster with new token(s) ownership -- not the same token(s) owned by the dead node. As such, it will involve a lot of unnecessary data shuffling between nodes in the cluster. Cheers!

Not the answer you're looking for? Browse other questions tagged or ask your own question.