7
  • What are they and how do they work?
  • Where are they used?
  • When should I (not) use them?

I've heard the word over and over again, yet I don't know its exact meaning.

What I heard is that they allow associative arrays by sending the array key through a hash function that converts it into an int and then uses a regular array. Am I right with that?

(Notice: This is not my homework; I go too school but they teach us only the BASICs in informatics)

1
  • I suspect they won't tell you very much about hash tables. They're actually quite hard to analyze (you need a probabilistic normal-case argument, not a worst-case argument) and depend on a lot of magic. Most teachers of Algorithms classes prefer to use more theoretically beautiful constructs like balanced trees, as they show off recursive arguments. Practical code uses hash tables though; it just works so well. Commented Apr 12, 2010 at 21:32

3 Answers 3

6

Wikipedia seems to have a pretty nice answer to what they are.

You should use them when you want to look up values by some index.

As for when you shouldn't use them... when you don't want to look up values by some index (for example, if all you want to ever do is iterate over them.)

7
  • How in gods name can there be a hash function that always outputs the right integers like 0 1 2 3 when receiving "abc" "myCatIsFat" or "101010" as input?!
    – keg
    Commented Apr 12, 2010 at 20:32
  • 1
    @keg Hash functions usually return a more or less random-looking value, not sequential integers. Why are you asking? Commented Apr 12, 2010 at 20:37
  • @key That would be the perfect hash function as described in the wiki article. Read it to get why this function is hard to find and how to make the things at least semi-optimal.
    – PeterMmm
    Commented Apr 12, 2010 at 20:40
  • It's very very hard to write a good hash function. It's very easy to write a bad one, and the literature is not great (it's much more extensive on cryptographic hash functions). The input language matters a lot too, and some hash functions are theoretically great but slow in practice. Commented Apr 12, 2010 at 20:55
  • @keg as an example, a hash function that, say, just returned the length of the string representation of the input would return integers for arbitrary input. Not a very good hash function mind you, see @PeterMmm comment
    – brabster
    Commented Apr 12, 2010 at 20:55
3

You've about got it. They're a very good way of mapping from arbitrary things (keys) to arbitrary things (values). The idea is that you apply a function (a hash function) that translates the key to an index into the array where you store the values; the hash function's speed is typically linear in the size of the key, which is great when key sizes are much smaller than the number of entries (i.e., the typical case).

The tricky bit is that hash functions are usually imperfect. (Perfect hash functions exist, but tend to be very specific to particular applications and particular datasets; they're hardly ever worthwhile.) There are two approaches to dealing with this, and each requires storing the key with the value: one (open addressing) is to use a pre-determined pattern to look onward from the location in the array with the hash for somewhere that is free, the other (chaining) is to store a linked list hanging off each entry in the array (so you do a linear lookup over what is hopefully a short list). The cases of production code where I've read the source code have all used chaining with dynamic rebuilding of the hash table when the load factor is excessive.

0
1

Good hash functions are one way functions that allow you to create a distributed value from any given input. Therefore, you will get somewhat unique values for each input value. They are also repeatable, such that any input will always generate the same output.

An example of a good hash function is SHA1 or SHA256.

Let's say that you have a database table of users. The columns are id, last_name, first_name, telephone_number, and address.

While any of these columns could have duplicates, let's assume that no rows are exactly the same.

In this case, id is simply a unique primary key of our making (a surrogate key). The id field doesn't actually contain any user data because we couldn't find a natural key that was unique for users, but we use the id field for building foreign key relationships with other tables.

We could look up the user record like this from our database:

SELECT * FROM users
WHERE last_name = 'Adams'
AND first_name = 'Marcus'
AND address = '1234 Main St'
AND telephone_number = '555-1212';

We have to search through 4 different columns, using 4 different indexes, to find my record.

However, you could create a new "hash" column, and store the hash value of all four columns combined.

String myHash = myHashFunction("Marcus" + "Adams" + "1234 Main St" + "555-1212");

You might get a hash value like AE32ABC31234CAD984EA8.

You store this hash value as a column in the database and index on that. You now only have to search one index.

SELECT * FROM users
WHERE hash_value = 'AE32ABC31234CAD984EA8';

Once we have the id for the requested user, we can use that value to look up related data in other tables.

The idea is that the hash function offloads work from the database server.

Collisions are not likely. If two users have the same hash, it's most likely that they have duplicate data.

2
  • It's 7 years later, but I came to post my own question and found this instead. I don't know why you don't have any upvotes (mine's the 1st). I thought yours was the clearest and most understandable for those who know little or nothing about the subject. BUT: If I know the pk, I can look up all that information, so I still see no advantage to a hash, which I would also have to look up - especially if I'm risking collisions I won't get with a pk. I don't understand how this offloads the db server. If the hash and the values are all in the db, aren't you still hitting the db anyway? Commented Feb 28, 2017 at 22:57
  • I wrote that answer a long time ago, way before I actually researched collisions. Frankly put, you're not going to have a collision. You don't have to worry about them. While collisions are theoretically possible, no one's ever found one (for SHA1 or SHA256), and I suspect that if one were found, it wouldn't be between two strings of human readable text. It offloads work from the DB because the DB server doesn't have to do as much searching. It's searching one index instead of many. Commented Mar 2, 2017 at 15:17

Not the answer you're looking for? Browse other questions tagged or ask your own question.