69

I was messing around with making a command line parser and was wondering what kind of hash algorithm python dict's use?

The way I have it set up, I have a pattern match algorithm which matches tokenized input sequences with a dictionary key. Some of the keys are relatively long (length 5 or 6 tuples of 6-7 character strings). I was wondering if there was a point at which long dictionary keys significantly reduce the efficiency of key retrieval.

3
  • 1
    Take a look at this question. It has a link to this page which describes how python hashes some different types and it might well be helpful to you.
    – srgerg
    Commented Jan 25, 2012 at 4:58
  • 1
    Take a look at Objects/dictnotes.txt
    – jfs
    Commented Jan 25, 2012 at 5:13
  • 2
    A lot of the modern hashing code was modified for PEP 456 which is documented here: python.org/dev/peps/pep-0456. The answer is: multiple hash functions can be used depending on compilation arguments and string size. Commented Jan 23, 2019 at 22:22

1 Answer 1

74

The hash that it uses depends on the object being used as a key -- each class can define its own __hash__() method, and the value that it returns for a particular instance is what is used for the dictionary.

Python itself provides the hash implementation for str and tuple types. A quick look at the source should reveal the exact algorithm for those.

The hash of a tuple is based on the hashes of its contents. The algorithm is essentially this (simplified slightly):

def hash(tuple):
    mult = 1000003
    x = 0x345678
    for index, item in enumerate(tuple):
        x = ((x ^ hash(item)) * mult) & (1<<32)
        mult += (82520 + (len(tuple)-index)*2)
    return x + 97531

For strings, the interpreter also iterates over every character, combining them with this (again, slightly simplified) algorithm:

def hash(string):
    x = string[0] << 7
    for chr in string[1:]:
        x = ((1000003 * x) ^ chr) & (1<<32)
    return x

A bigger issue to worry about is avoiding hash collisions. Colliding hash keys will cause a linear search as the dictionary tries to find a place to store the new object (this is now being recognized as a security issue, and tha behavior may be changing in upcoming python versions)

9
  • 2
    Oh ok. For some reason I assumed python just used a generic bytecode hash algorithm for all data types. As far as colliding hash keys goes, I don't think that's going to be an issue, as the number of keys I will have is (relatively) small -- probably in the thousands. Pardon my ingnorance, but how do colliding hashes and linear searches become a security issue? Commented Jan 25, 2012 at 5:06
  • 7
    @Joel Cornett: This is a security issue because hash tables use buckets to store keys, and keys with the same hash code will be hashed to the same bucket, forcing the hash table to do a linear search each time it searches for a key, which can be very inefficient (and can even cause denial of service) if the number of keys is large. Denial-of-service attacks can result if a program encounters a hash table with distinct keys that hash to the same hash code.
    – Peter O.
    Commented Jan 25, 2012 at 5:17
  • 7
    If an attacker can control the keys that are used in your dictionary, then they might be able to insert hundreds or thousands of colliding keys, making insert operations very very slow. In some cases, this could cause a machine to become unresponsive, or a database to become unusable -- a Dos attack Commented Jan 25, 2012 at 5:24
  • 1
    Are there documented cases of such attacks being launched? It seems like it would be prohibitively difficult against all but the simplest system, and even harder without white-box knowledge. Commented Jan 25, 2012 at 8:16
  • 2
    The first attack scenario I heard being discussed was against web services -- if the POST variables end up in a dictionary keyed on their names, then an attacker has complete control over the keys, and can slow down a server arbitrarily, with enough variables. I don't know if there are such attacks in the wild yet, though. Commented Jan 25, 2012 at 19:45

Not the answer you're looking for? Browse other questions tagged or ask your own question.