17

Let's say we have the following Python class (the problem exists in Java just the same with equals and hashCode)

class Temperature:
    def __init__(self, degrees):
        self.degrees = degrees

where degrees is the temperature in Kelvin as a float. Now, I would like to implement equality testing and hashing for Temperature in a way that

  • compares floats up to an epsilon difference instead of direct equality testing,
  • and honors the contract that a == b implies hash(a) == hash(b).
def __eq__(self, other):
    return abs(self.degrees - other.degrees) < EPSILON

def __hash__(self):
    return # What goes here?

The Python documentation talks a bit about hashing numbers to ensure that hash(2) == hash(2.0) but this is not quite the same problem.

Am I even on the right track? And if so, what is the standard way to implement hashing in this situation?

Update: Now I understand that this type of equality testing for floats eliminates the transitivity of == and equals. But how does that go together with the "common knowledge" that floats should not be compared directly? If you implement an equality operator by comparing floats, static analysis tools will complain. Are they right to do so?

7
  • 10
    why is the question has Java's tag?
    – Laiv
    Commented Apr 29, 2019 at 7:53
  • 8
    About your update: I would say that hashing floats is generally a questionable thing. Try to avoid using floats as keys or as set elements. Commented Apr 29, 2019 at 9:04
  • 7
    @Neil: At the same time, doesn't rounding sound like integers? By that I mean: if you can round to, say, thousandths of degrees, then you could simply used a fixed-point representation -- an integer expressing the temperature in thousandths of degrees. For ease of use, you could have a getter/setter transparently converting from/to floats if you wish to... Commented Apr 29, 2019 at 11:12
  • 4
    Kelvins are no longer degrees. Degrees are also ambiguous. Why not just call it kelvin? Commented Apr 29, 2019 at 12:01
  • 5
    Python has more-or-less excellent fixed-point support, maybe that’s something for you. Commented Apr 29, 2019 at 14:07

6 Answers 6

41

implement equality testing and hashing for Temperature in a way that compares floats up to an epsilon difference instead of direct equality testing,

Fuzzy equality violates the requirements that Java places on the equals method, namely transitivity, i.e. that if x == y and y == z, then x == z. But if you do an fuzzy equality with, for example, an epsilon of 0.1, then 0.1 == 0.2 and 0.2 == 0.3, but 0.1 == 0.3 does not hold.

While Python does not document such a requirement, still the implications of having a non-transitive equality make it a very bad idea; reasoning about such types is headache-inducing.

So I strongly recommend you don't do that.

Either provide exact equality and base your hash on that in the obvious way, and provide a separate method to do the fuzzy matching, or go with the equivalence class approach suggested by Kain. Though in the latter case, I recommend you fix your value to a representative member of the equivalence class in the constructor, and then go with simple exact equality and hashing for the rest; it's much easier to reason about the types this way.

(But if you do that, you might as well use a fixed point representation instead of floating point, i.e. you use an integer to count thousandths of a degree, or whatever precision you require.)

5
  • 2
    interesting thoughts. So by accumulating millions of epsilon and with transitivity you can conclude that anything is equal to anything else :-) But does this mathematic constraint acknowledge the discrete foundation of floating points, which in many cases are approximations of the number they are intended to represent ?
    – Christophe
    Commented Apr 29, 2019 at 6:51
  • @Christophe Interesting question. If you think about it, you'll see that this approach will make a single large equivalence class out of floats whose resolution is greater than epsilon (it's centered on 0, of course) and leave the other floats in their own class each. But that's not the point, the real problem is that whether it concludes that 2 numbers are equal depends on whether there is a third one compared and the order in which that is done.
    – Ordous
    Commented Apr 29, 2019 at 14:50
  • 1
    Addressing @OP's edit, I would add that the incorrectness of floating-point == should "infect" the == of types containing them. That is, if they follow your advice of providing an exact equality, then their static analysis tool should further be configured to warn when equality is used on Temperature. It's the only thing you can do, really.
    – HTNW
    Commented Apr 29, 2019 at 16:56
  • 1
    @HTNW: That would be too simple. A ratio class might have a float approximation field which does not participate in ==. Besides, the static analysis tool will already give a warning inside the == implementation of classes when one of the members being compared is a float type.
    – MSalters
    Commented Apr 30, 2019 at 9:58
  • @MSalters ? Presumably, sufficiently configurable static analysis tools can do what I suggested just fine. If a class has a float field that doesn't participate in ==, then don't configure your tool to warn on == on that class. If the class does, then presumably marking the class's == as "too exact" will cause the tool to ignore that sort of error within the implementation. E.g. in Java, if @Deprecated void foo(), then void bar() { foo(); } is a warning, but @Deprecated void bar() { foo(); } is not. Maybe many tools don't support this, but some might.
    – HTNW
    Commented Apr 30, 2019 at 12:24
17

Good Luck

You are not going to be able to achieve that, without being stupid with hashes, or sacrificing the epsilon.

Example:

Assume that each point hashes to its own unique hash value.

As floating point numbers are sequential there will be up to k numbers prior to a given floating point value, and up to k numbers after a given floating point value which are within some epsilon of the given point.

  1. For each two points within epsilon of each other that do not share the same hash value.

    • Adjust the hashing scheme so that these two points hash to the same value.
  2. Inducting for all such pairs the entire sequence of floating point numbers will collapse toward a single has value.

There are a few cases where this will not hold true:

  • Positive/Negative Infinity
  • NaN
  • A few De-normalised ranges that may not be linkable to the main range for a given epsilon.
  • perhaps a few other format specific instances

However >=99% of the floating point range will hash to a single value for any value of epsilon that includes at least one floating point value above or below some given floating point value.

Outcome

Either >= 99% entire floating point range hashes to a single value seriously comprimising the intent of a hash value (and any device/container relying on a fairly distributed low-collision hash).

Or the epsilon is such that only exact matches are permitted.

Granular

You could of course go for a granular approach instead.

Under this approach you define exact buckets down to a particular resolution. ie:

[0.001, 0.002)
[0.002, 0.003)
[0.003, 0.004)
...
[122.999, 123.000)
...

Each bucket has a unique hash, and any floating point within the bucket compares equal to any other float in the same bucket.

Unfortunately it is still possible for two floats to be epsilon distance away, and have two separate hashes.

5
  • 2
    I agree that the granular approach here would probably be best, if that fits OP's requirements. Though I'm afraid OP has like +/- 0.1% type requirements, meaning it can't be granular.
    – Neil
    Commented Apr 29, 2019 at 6:35
  • 4
    @DocBrown The "not possible" part is correct. If epsilon based equality should imply that the hash codes are equal, then you automatically have all hash codes equal, so the hash function is not useful anymore. The buckets approach can be fruitful, but you will have numbers with different hash codes that are arbitrarily close to each other. Commented Apr 29, 2019 at 8:59
  • 2
    The bucket approach can be modified by checking not only the bucket with the exact hash key, but also the two neighboured buckets (or at least one of them) for their content as well. That elimininates the problem of those edge cases for the cost of increasing the running time by a factor of at most two (when implemented correctly). However, it does not change the general running time order.
    – Doc Brown
    Commented Apr 29, 2019 at 15:26
  • While you are right in spirit, not everything will collapse. With a fixed small epsilon, most numbers will only equal themselves. Of course, for those the epsilon will be useless, so again, in spirit you are correct.
    – Carsten S
    Commented Apr 30, 2019 at 9:43
  • 1
    @CarstenS Yes, my statement that 99% of the range hashes to a single hash does not actually cover the whole float range. There are many high range values who are separated by more than epsilon that will hash to their own unique buckets.
    – Kain0_0
    Commented Apr 30, 2019 at 23:50
8

You can model your temperature as an integer under the hood. Temperature has a natural lower bound (-273.15 Celsius). So, double (-273.15 is equal to 0 for your underlying integer). The second element that you need is the granularity of your mapping. You are already using this granularity implicitly; it is your EPSILON.

Just divide your temperature by EPSILON and take the floor of it, now your hash and your equal will behave in sync. In Python 3 the integer is unbounded, EPSILON can be smaller if you like.

BEWARE If you change the value of EPSILON and you have serialised the object they will be not compatible!

#Pseudo code
class Temperature:
    def __init__(self, degrees):
        #CHECK INVALID VALUES HERE
        #TRANSFORM TO KELVIN HERE
        self.degrees = Math.floor(kelvin/EPSILON)
1

Implementing a floating-point hash table that can find things that are "approximately equal" to a given key will require using a couple of approaches or a combination thereof:

  1. Round each value to an increment which is somewhat larger than the "fuzzy" range before storing it in the hash table, and when trying to find a value, check the hash table for the rounded values above and below the value sought.

  2. Store each item within the hash table using keys that are above and below the value being sought.

Note that using either approach will likely require that hash table entries not identify items, but rather lists, since there will likely be multiple items associated with each key. The first approach above will minimize the required hash table size, but each search for an item not in the table will require two hash-table lookups. The second approach will quickly be able to identify that items aren't in the table, but will generally require the table to hold about twice as many entries as would otherwise be required. If one is trying to find objects in 2D space, it may be useful to use one approach for the X direction and one for the Y direction, so that instead of having each item stored once but requiring four query operations for each lookup, or being able to use one lookup to find an item but having to store each item four times, one would store each item twice and use two lookup operations to find it.

0

You can of course define “almost equal” by deleting say the last eight bits of the mantissa and then comparing or hashing. The problem is that numbers very close to each other may be different.

There is some confusion here: if two floating point numbers compare equal, they are equal. To check if they are equal, you use “==“. Sometimes you don’t want to check for equality, but when you do, “==“ is the way to go.

0

This isn't an answer, but an extended comment that may be helpful.

I have been working on a similar problem, while using MPFR (based on GNU MP). The "bucket" approach as outlined by @Kain0_0 seems to give acceptable results, but be aware of the limitations highlighted in that answer.

I wanted to add that -- depending on what you are trying to do -- using an "exact" (caveat emptor) computer algebra system like Mathematica may help supplement or verify an inexact numerical program. This will allow you to compute results without worrying about rounding, for example, 7*√2 - 5*√2 will yield 2 instead of 2.00000001 or similar. Of course, this will introduce additional complications that may or may not be worth it.

Not the answer you're looking for? Browse other questions tagged or ask your own question.