10

I've been brushing up on algorithms and reviewed these two methods of implementing hash tables. It seems like they largely have similar performance characteristics and memory requirements.

I can think of some disadvantages to linear probing -- namely, that widening the array could be expensive (but this is done, what, 2 log N times at most? Probably not a big deal) and that managing deletions is a bit more difficult. But I'm assuming there are advantages, too, or it wouldn't be presented in textbooks as a viable method of implementation next to the more obvious implementation.

Why would you choose one over the other?

3 Answers 3

10

With linear probing (or any probing really) a deletion has to be "soft". This means you need to put in a dummy value (often called a tombstone) that won't match anything the user could search for. Or you would need to rehash every time. Rehashing when too many tombstones build up is still advised or some strategy to defrag the graveyard.

Separate chaining (each bucket is a pointer to a linked list of values) has the disadvantage that you end up searching a linked list with all cache-related issues at hand.

One other advantage of the probing method is that the values all live in the same array. This makes copy-on-write very easy by just copying only the array. If you can be assured the original is not modified by way of class invariant then a taking a snapshot is O(1) and can be done without locking.

2
  • You actually kind of reminded me of a related issue... if you're working with a non-nullable type it's not obvious what the dummy value could be. For a double, for instance, do you just say, well, NaN is not a valid key and use that?
    – Casey
    Commented Apr 7, 2015 at 15:28
  • 1
    @emodendroket or add a flag to the array type for "deleted", Likewise for "empty" (as you don't have null for that either) though with copy-on-write you can afford to do some extra work rehashing on deletion. Commented Apr 7, 2015 at 15:30
4

Check out this great answer:

https://stackoverflow.com/questions/23821764/why-do-we-use-linear-probing-in-hash-tables-when-there-is-separate-chaining-link

Quoting here:

I'm surprised that you saw chained hashing to be faster than linear probing - in practice, linear probing is typically significantly faster than chaining. In fact, that's the main reason it's used.

Although chained hashing is great in theory and linear probing has some known theoretical weaknesses (such as the need for five-way independence in the hash function to guarantee O(1) expected lookups), in practice linear probing is typically significantly faster due to locality of reference. Specifically, it's faster to access a series of elements in an array than it is to follow pointers in a linked list, so linear probing tends to outperform chained hashing even if it has to investigate more elements.

There are other wins in chained hashing. For example, insertions into a linear probing hash table don't require any new allocations (unless you're rehashing the table), so in applications like network routers where memory is scarce, it's nice to know that once the table is set up, the elements can be placed into it with no risk of a malloc fail.

2

I'll jump in with a biased answer where I actually prefer separate chaining with singly-linked lists and find it easier to achieve performance with them (I'm not saying they're optimal, just easier for my use cases), as contradictory as that sounds.

Of course the theoretical optimum is still a hash table without collisions whatsoever or a probing technique with minimal clustering. However, the separate chaining solution doesn't have to deal with clustering problems whatsoever.

That said, the data representation I use does not invoke a separate memory allocation per node. Here it is in C:

struct Bucket
{
    int head;
};

struct BucketNode
{
    int next;
    int element;
};

struct HashTable
{
    // Array of buckets, pre-allocated in advance.
    struct Bucket* buckets;

    // Array of nodes, pre-allocated assuming the client knows
    // how many nodes he's going to insert in advance. Otherwise
    // realloc using a similar strategy as std::vector in C++.
    struct BucketNode* nodes;

    // Number of bucket heads.
    int num_buckets;

    // Number of nodes inserted so far.
    int num_nodes;
};

The buckets are just 32-bit indices (I don't even use a struct in actuality) and the nodes are just two 32-bit indices. Often I don't even need the element index because the nodes are often stored in parallel with the array of elements to be inserted into the table, reducing the overhead of the hash table to 32-bits per bucket and 32-bits per element inserted. The real version I use more often looks like this:

struct HashTable
{
    // Array of head indices. The indices point to entries in the 
    // second array below.
    int* buckets;

    // Array of next indices parallel to the elements to insert.
    int* next_indices;

    // Number of bucket heads.
    int num_buckets;
};

Also if spatial locality degrades, I can easily perform a post-processing pass where I construct a new hash table where each bucket node is contiguous with the other (trivial copy function which just does a linear pass through the hash table and constructs a new one -- due to the nature in which it traverses the hash table, the copy ends up with all neighboring nodes in a bucket contiguous to each other).

As for probing techniques, it comes with the benefits that the spatial locality is already there from the beginning without memory pools or a backing array as I use, and they also don't have the 32-bit overhead per bucket and node, but then you might have to deal with clustering problems which can start to accumulate in a vicious way with lots of collisions.

I find the very nature of clustering to be a headache that requires a lot of analysis in the event of many collisions. The benefit of this solution is that I can often achieve a decent result the first time around without such deep analysis and testing. Also if the table resizes on its own implicitly, I've run into cases where such designs ended up blowing up memory usage in ways that far surpassed what this basic solution which requires a 32-bits per bucket and 32-bits per node would run into even in the worst-case scenario. It's a solution that avoids becoming too bad even if there are a number of collisions.

Most of my codebase revolves around data structures which store indices and are often storing indices in parallel with the array of elements to be inserted. That crunches down the memory size, avoids superfluous deep copies of the elements to insert, and makes it very easy to reason about memory usage. Aside from that, in my case I tend to benefit as much from predictable performance as optimal performance. An algorithm which is optimal in many common case scenarios but can perform horribly in worst-case scenarios is often less preferable for me than one which performs reasonably well all the time and doesn't cause frame rates to stutter at unpredictable times, and so I tend to favor those kinds of solutions.

Not the answer you're looking for? Browse other questions tagged or ask your own question.