3

I often see or hear of modulus being used as a last step of hashing or after hashing. e.g. h(input)%N where h is the hash function and % is the modulus operator. If I am designing a hash table, and want to map a large set of keys to a smaller space of indices for the hash table, doesn't the modulus operator achieve that? Furthermore, if I wanted to randomize the distribution across those locations within the hash table, is the remainder generated by modulus not sufficient? What does the hashing function h provide on top of the modulus operator?

2 Answers 2

3

I often see or hear of modulus being used as a last step of hashing or after hashing. e.g. h( input ) % N where h is the hash function and % is the modulus operator.

Indeed.

If I am designing a hash table, and want to map a large set of keys to a smaller space of indices for the hash table, doesn't the modulus operator achieve that?

That's precisely the purpose of the modulo operator: to restrict the range of array indexes, so yes.

But you cannot simply use the modulo operator by itself: the modulo operator requires an integer value: you cannot get the "modulo of a string over N" or "modulo of an object-graph over N"[1].

Furthermore, if I wanted to randomize the distribution across those locations within the hash table, is the remainder generated by modulus not sufficient?

No, it does not - because the modulo operator doesn't give you pseudorandom output - nor does it have any kind of avalanche effect - which means that similar input values will have similar output hashes, which will result in clustering in your hashtable bins, which will result in subpar performance due to the greatly increased likelihood of hash-collisions (and so requiring slower techniques like linear-probing which defeat the purpose of a hashtable because you lose O(1) lookup times.

What does the hashing function h provide on top of the modulus operator?

The domain of h can be anything, especially non-integer values.


[1] Technically speaking, this is possible if you use the value of the memory address of an object (i.e. an object pointer), but that doesn't work if you have hashtable keys that don't use object identity, such as a stack-allocated object or custom struct.

2
  • "No, it does not - because the modulo operator doesn't give you pseudorandom output". If we had integers as inputs (and perform modulus operator on them), would the output be not be evenly distributed so long as the input doesn't on the rare case overlap with multiple of N in number % N? I definitely see your point about pseudorandom being helpful to prevent an unpredictable stream of inputs with possible patterns from causing collisions, however. Commented Feb 5, 2021 at 19:23
  • 1
    Correct - in the case where the input key type is an integer then a hash-function function may be redundant, however you need to ensure you have an optimal number of bins for the domain of your key values, otherwise you'll have unused bins (e.g. if you're restricted to ASCII char values [A-Z] then you don't need more than 26 bins - and in fact, you may as well skip using a hashtable and use a directly indexable fixed-size array instead.
    – Dai
    Commented Feb 5, 2021 at 20:28
2

First, the hash function's primary purpose is to turn something that's not a number into a number. Even if you just use modulus after that to get a number in your range, getting the number is still the first step and is the responsibility of the hash function. If you're hashing integers and you just use the integers as their own hashes, it isn't that there's no hash function, it's that you've chosen the identity function as your hash function. If you don't write out the function, that means you inlined it.

Second, the hash function can provide a more unpredictable distribution to reduce the likelihood of unintentional collisions. The data people work with often contain patterns and if you're just using a simple identity function with modulus, the pattern in inputs may be such that the modulus is more likely to cause collisions. The hash function presents an opportunity to break this up so it becomes unlikely that modulus exposes patterns in the original data sequence.

Not the answer you're looking for? Browse other questions tagged or ask your own question.