I'll jump in with a biased answer where I actually prefer separate chaining with singly-linked lists and find it easier to achieve performance with them (I'm not saying they're optimal, just easier for my use cases), as contradictory as that sounds.
Of course the theoretical optimum is still a hash table without collisions whatsoever or a probing technique with minimal clustering. However, the separate chaining solution doesn't have to deal with clustering problems whatsoever.
That said, the data representation I use does not invoke a separate memory allocation per node. Here it is in C:
struct Bucket
{
int head;
};
struct BucketNode
{
int next;
int element;
};
struct HashTable
{
// Array of buckets, pre-allocated in advance.
struct Bucket* buckets;
// Array of nodes, pre-allocated assuming the client knows
// how many nodes he's going to insert in advance. Otherwise
// realloc using a similar strategy as std::vector in C++.
struct BucketNode* nodes;
// Number of bucket heads.
int num_buckets;
// Number of nodes inserted so far.
int num_nodes;
};
The buckets are just 32-bit indices (I don't even use a struct in actuality) and the nodes are just two 32-bit indices. Often I don't even need the element
index because the nodes are often stored in parallel with the array of elements to be inserted into the table, reducing the overhead of the hash table to 32-bits per bucket and 32-bits per element inserted. The real version I use more often looks like this:
struct HashTable
{
// Array of head indices. The indices point to entries in the
// second array below.
int* buckets;
// Array of next indices parallel to the elements to insert.
int* next_indices;
// Number of bucket heads.
int num_buckets;
};
Also if spatial locality degrades, I can easily perform a post-processing pass where I construct a new hash table where each bucket node is contiguous with the other (trivial copy function which just does a linear pass through the hash table and constructs a new one -- due to the nature in which it traverses the hash table, the copy ends up with all neighboring nodes in a bucket contiguous to each other).
As for probing techniques, it comes with the benefits that the spatial locality is already there from the beginning without memory pools or a backing array as I use, and they also don't have the 32-bit overhead per bucket and node, but then you might have to deal with clustering problems which can start to accumulate in a vicious way with lots of collisions.
I find the very nature of clustering to be a headache that requires a lot of analysis in the event of many collisions. The benefit of this solution is that I can often achieve a decent result the first time around without such deep analysis and testing. Also if the table resizes on its own implicitly, I've run into cases where such designs ended up blowing up memory usage in ways that far surpassed what this basic solution which requires a 32-bits per bucket and 32-bits per node would run into even in the worst-case scenario. It's a solution that avoids becoming too bad even if there are a number of collisions.
Most of my codebase revolves around data structures which store indices and are often storing indices in parallel with the array of elements to be inserted. That crunches down the memory size, avoids superfluous deep copies of the elements to insert, and makes it very easy to reason about memory usage. Aside from that, in my case I tend to benefit as much from predictable performance as optimal performance. An algorithm which is optimal in many common case scenarios but can perform horribly in worst-case scenarios is often less preferable for me than one which performs reasonably well all the time and doesn't cause frame rates to stutter at unpredictable times, and so I tend to favor those kinds of solutions.