I don't know if the title of the question is accurated or not, but I was thinking of the following problem:
- I have a set of numbers, let's call them identifiers. Each identifier is stored on a 64-bit long integer; but in practice I only have, let's say, never more than 4 million different identifiers at any time.
- I want to store some information for each existing identifier, for example in a dictionary. If I have exactly 4M different numbers of 8B width each (worst case), to store the dictionary I would need around 32MiB of space only for the keys.
- However, let's consider that I only need 22-bits to have up to 4M different identifiers. If I'd have a way to map each unique 64-bit key to a unique 22-bit key, for this worst case of 4M keys, these 32MiB worth of keys would be reduced to 11MiB; a 65% reduction.
What is the most space-efficient way to implement that 64-bit to 22-bit number mapping?
I was thinking that, assuming the existing keys are evenly distributed over the 64-bit range, between each pair of keys there would be a $2^{42}$-width gap, which is $2^{22}$ times bigger than the maximum number of keys... a lot. In that extreme case (evenly distributed), that would mean that to jump from one id to the next one, I've to add $2^{42}$, in other words, the last 42 bits of all ids would be the same, which means that my mapping would consist of just triming the last 42 bits of each input id.
If the keys are anything but evenly distributed, they must form clusters instead, and so you could identify those clusters, idenfity their centers, assign an 0-based index to each cluster, and then translate each input identifier to the cluster-idx to which the id belongs and the offset within the cluster. If the sum of bits for the cluster-idx and the bits to store the offset is $\leq$ 22-bits, you achieved the goal.
For example, you can sort the clusters by size and assign 22-bits as offset for cluster 0, 21-bits as offset for cluster 1, 20-bits as offset for clusters 2 and 3 (10 and 11), 19-bits for clusters 4, 5, 6, and 7 (100, 101, 110, 111), etc. That means that you would need to store extra storage to maintain the cluster centers and be able to know to which cluster does each potential input id belongs to.
But I don't know how to combine all of those ideas together to implement a mapping-strategy, and how to generalize it for the general problem of mapping up to 2^k identifiers of n-bits using the lowest number of bits possible.