Memory-efficient representation of keys over a huge domain

Question

I don't know if the title of the question is accurated or not, but I was thinking of the following problem:

I have a set of numbers, let's call them identifiers. Each identifier is stored on a 64-bit long integer; but in practice I only have, let's say, never more than 4 million different identifiers at any time.
I want to store some information for each existing identifier, for example in a dictionary. If I have exactly 4M different numbers of 8B width each (worst case), to store the dictionary I would need around 32MiB of space only for the keys.
However, let's consider that I only need 22-bits to have up to 4M different identifiers. If I'd have a way to map each unique 64-bit key to a unique 22-bit key, for this worst case of 4M keys, these 32MiB worth of keys would be reduced to 11MiB; a 65% reduction.

What is the most space-efficient way to implement that 64-bit to 22-bit number mapping?

I was thinking that, assuming the existing keys are evenly distributed over the 64-bit range, between each pair of keys there would be a $2^{42}$-width gap, which is $2^{22}$ times bigger than the maximum number of keys... a lot. In that extreme case (evenly distributed), that would mean that to jump from one id to the next one, I've to add $2^{42}$, in other words, the last 42 bits of all ids would be the same, which means that my mapping would consist of just triming the last 42 bits of each input id.

If the keys are anything but evenly distributed, they must form clusters instead, and so you could identify those clusters, idenfity their centers, assign an 0-based index to each cluster, and then translate each input identifier to the cluster-idx to which the id belongs and the offset within the cluster. If the sum of bits for the cluster-idx and the bits to store the offset is $\leq$ 22-bits, you achieved the goal.

For example, you can sort the clusters by size and assign 22-bits as offset for cluster 0, 21-bits as offset for cluster 1, 20-bits as offset for clusters 2 and 3 (10 and 11), 19-bits for clusters 4, 5, 6, and 7 (100, 101, 110, 111), etc. That means that you would need to store extra storage to maintain the cluster centers and be able to know to which cluster does each potential input id belongs to.

But I don't know how to combine all of those ideas together to implement a mapping-strategy, and how to generalize it for the general problem of mapping up to 2^k identifiers of n-bits using the lowest number of bits possible.

D.W. · Accepted Answer · 2023-12-28 20:07:40Z

Short answer: It's most likely not worth worrying about. The maximum possible amount of space saving is small enough that it's not worth the additional code complexity.

One can prove that any data structure to store the mapping needs at least $\lg {n \choose k}$ bits, where here $n=2^{64}$ and $k=2^{22}$. This evaluates to about 22.8MB. So the most you can hope to reduce space usage is from 32MB to 22.8MB. Probably not worth the bother, from any engineering perspective, as it won't fit in L2 cache either way, and it will easily fit in RAM on any reasonable machine. I suspect that the engineering cost of dealing with complex data structures, and the increased time to compute the mapping, will probably outweigh any space savings in most cases.

The long answer: Your question states you care only about the space efficiency of the data structure to store the set of 4M identifiers, and there are no requirements on how long it takes to compute the mapping.

If that's all that matters, the most space efficient way to store the set of 4M identifiers is by using a ranking/unranking scheme for all 4M-element subsets of $\{0,1\}^{64}$, also known as a ranking scheme for combinations. See, e.g., https://computationalcombinatorics.wordpress.com/2012/09/10/ranking-and-unranking-of-combinations-and-permutations/ for one such scheme. This scheme is optimal in space usage: it represents a $k$-subset of a space of size $n$ using $\lg {n \choose k}$ bits. Here $k=2^{22}$ and $n=2^{64}$. So that is a complete solution to your problem.

In practice, such schemes are normally unusable, because while space usage is optimal, the time to compute the mapping is too large. So in a practical setting, we probably care more about a good tradeoff between the space required to store the data structure, and the time to compute the mapping.

From a theoretical perspective, you might be interested in succinct data structures. See, e.g., https://en.wikipedia.org/wiki/Succinct_data_structure, https://cstheory.stackexchange.com/q/5010/5038, and these papers:

Daisuke Okanohara, Kunihiko Sadakane, Practical entropy-compressed rank/select dictionary, ALENEX 2007.
Sebastiano Vigna, Broadword implementation of rank/select queries, WEA 2008.

Could you try to add a little explanation about why you need at least $\lg{n\choose k}$ bits? I have tried to figure it that out by checking out your links but I haven't get it. — ABu, Commented Jan 3 at 21:11
@ABu, there are ${n \choose k}$ ways to choose a set containing $k$ values out of a universe of $n$ possible values. If there are $N$ possibilities for some value, then it takes at least $\lg N$ bits to represent it. So it takes at least $\lg {n \choose k}$ bits to represent a set of $k$ values selected from a universe of $n$ possible values. — D.W., Commented Jan 3 at 22:11
Ok, I can't see that that's the minimal cost to store the index of the particular combination among the whole set of possibilities, but that's not necessarily the cost of storing a particular combination only, right? — ABu, Commented Jan 4 at 14:03
@ABu, I'm not sure I understand what you mean by a "combination" or "index" of a combination, but the answer applies to both - it applies to any representation or encoding of a combination, whether as an index or in any other way. — D.W., Commented Jan 4 at 19:46

tutizeri · Accepted Answer · 2023-12-29 22:23:17Z

0

If you have these keys, they require 8x4=32 bits of storage

But you can separate them in buckets

bucket: 1111
bucket: 1101

now you only need to store the numbers 0001 and 0010 in each bucket, and reduced your storage to half.

You can do it recursively, and reduce the storage space logarithmically.

The data structure is a type of Trie, you probably need a Radix Tree. Surely there is already a trie algorithm optimized for your problem.

answered Dec 29, 2023 at 22:23

tutizeri

1396 bronze badges

$\begingroup$ Yes that's the kind of strategy I'm trying now, but instead of exploiting common prefixes, I'm trying to exploit any subsequence, in the sense that each bucket is identified by a mask and a witness. The mask identifies the bits that all member of the bucket has in common, and a witness is any member thereof, so that n belongs to the bucket iff n & mask == witness & mask; but I'm trying to understand the properties and the feasibility of the solution. $\endgroup$
– ABu
Commented Dec 29, 2023 at 23:29
$\begingroup$ Please be clearer about what you mean by reduce the storage space "logarithmically". I don't believe that is possible in this case. For your example, I don't think you've counted the cost of storing 1111 and 1101, nor of identifying which bucket each value falls into, nor described what to do if the values don't fall into 2 buckets each containing 2 items. Therefore, I believe it takes more bits to store these values than your answer would suggest. $\endgroup$
– D.W. ♦
Commented Jan 3 at 22:12
$\begingroup$ @D.W. Write all n consecutive numbers in binary. Half of them start with 1, 1/4 start with 10, 1/8 start with 100, so you get a tree structure with height log(n). $\endgroup$
– tutizeri
Commented Jan 3 at 22:16
$\begingroup$ @tutizeri, the tree may have height log(n), but that doesn't mean the storage cost is reduced by a factor of log(n). I think this answer is making incorrect statements. $\endgroup$
– D.W. ♦
Commented Jan 3 at 23:42
$\begingroup$ @D.W. if you have $n$ consecutive numbers with $b$ bits, it needs $n.b$ bits, and to store $n$ different numbers you need $b=log(n)$, so you need $nlog(n)$ bits. A tree uses $\sum_i^{log(n)} i = (1+log_2n) log_2(n) /2$ bits, so it reduces $n$ to $log_2n$ $\endgroup$
– tutizeri
Commented Jan 4 at 0:00

| Show 4 more comments

Stack Exchange Network

Memory-efficient representation of keys over a huge domain

2 Answers 2

Not the answer you're looking for? Browse other questions tagged
algorithms
data-structures
data-compression
encoding-scheme
or ask your own question.

Hot Network Questions

Memory-efficient representation of keys over a huge domain

2 Answers 2

Not the answer you're looking for? Browse other questions tagged algorithmsdata-structuresdata-compressionencoding-scheme or ask your own question.

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
algorithms
data-structures
data-compression
encoding-scheme
or ask your own question.