1
$\begingroup$

Consider the following problem. You have a list containing $k$ total identifiers comprising $n$ unique identifiers such that $k \ge n$ (i.e. there can be duplicates). These identifiers are long strings, far longer than necessary to retain uniqueness of the $n$ items. These identifiers need to be used in a series of computations. To reduce the amount of I/O, you would like to perform a preprocessing step that shortens the identifiers as much as possible while retaining uniqueness. The original identifiers don't intrinsically encode any useful information, so you can happily discard them, you just need to perform a mapping of unique IDs in the original dataset to a new set of unique IDs in the smaller dataset.

You're given an alphabet $a$to use, e.g. the English alphabet with upper and lower case a...zA..Z and so you know the number of alphabet characters ($|a|$ = 52 in the example). The question is, given $a$ and $n$, how do you calculate the maximum string length $m$ needed to represent all $n$ unique items, assuming you can use variable-length strings of size in the interval $[1,m]$? So for example, $\{a, aa, bbb, aab, z, Z\}$ would all be valid identifiers with $m \ge 3$ and the alphabet a...zA...Z.

A simpler problem

There is a simpler problem that can be solved fairly easily, which is where the output strings are fixed length, i.e. every output string is length $m$. In this case, one can calculate the simple minimum $m_s$ as:

$m_s = \left \lceil{log_{|a|}n} \right \rceil$

E.g. for the binary alphabet $\left \lceil{log_{2}1000} \right \rceil = \left \lceil{9.97} \right \rceil = 10$ (to represent 1000 unique identifiers). But again this is for a fixed-length output string, where every string is 10 bits and you can't separately use strings of length $[1,9]$.

Num. unique identifiers with string of variable length $[1,m]$

If you're given $m$ up front then one can figure out how many unique identifiers $u$ are representable. It's:

$u = \sum_{i = 1}^{m} |a|^i$.

E.g. with a 52 letter alphabet and a variable length string with $m$ = 4 characters you get $52^1 + 52^2 + 52^3 + 52^4$. i.e. it's the sum of consecutive powers of the alphabet size, so $u$ translates to:

$u = \frac{|a|^{m + 1} - 1}{|a| - 1} - 1$

(see here for formula).

What I haven't been able to resolve is whether this helps with the problem of determining $m$ based on $n$.

A heuristic approach

The best I've been able to come up with so far is to say that you can upper bound $m$ by calculating $m_s$ ($m \le m_s$) i.e. anything which fits in fixed string length $m_s$ also fits in variable string length $m$. From here you can use a "binary search" approach to find the smallest acceptable $m$ between 1 and $m_s$ by calculating $u$ at each step. This takes time $O(log\:m_s)$.

What I'm interested to know is, is there a more efficient way to calculate $m$ given $n$ and $a$?

$\endgroup$
3
  • $\begingroup$ The final subtracted $1$ in the formula just above the (see here for formula) link should be part of the numerator. Interesting question... $\endgroup$
    – coffeemath
    Commented Nov 16, 2014 at 7:50
  • $\begingroup$ @coffeemath actually in my case I'd missed out a 1 accidentally (fixed in edit). I have the additional "-1" at the end of my equation which isn't in the linked article's equation because they count powers starting at 0 not 1. Because I'm working with new identifiers that are at least 1 char long (empty not allowed) I exclude the empty case, hence the minus 1 on the end. $\endgroup$ Commented Nov 16, 2014 at 8:41
  • 1
    $\begingroup$ Actually the sum from $a^1$ up to and including $a^m$ is $(a^{m+1}-a)/(a-1).$ [Note it's $a$ subtracted on the top part, not $1$, and no subtracted $1$ at the end. Do the usual trick but on the sum starting with $a^1$...] I have an answer below using this formula to get the smallest max length $m$ to cover $n$ different items. $\endgroup$
    – coffeemath
    Commented Nov 16, 2014 at 9:53

1 Answer 1

1
$\begingroup$

If I have this right, you know the alphabet size $a$ and you seek to use strings of length $k$ where $1 \le k \le m$ to code some $n$ given objects. Then the number of strings is $a+a^2+ \cdots +a^m$ which is $$\frac{a^{m+1}-a}{a-1}.$$ Putting this $\ge n$ then gives $a^{m+1}-a \ge (a-1)n,$ or $a^{m+1} \ge (a-1)n+a.$ Taking logs then gives the inequality $$m+1 \ge \frac{\log((a-1)n+a)}{\log a},$$ which after subtracting $1$ from each side and applying a ceiling gives the minimal max length $m$ for the strings to represent $n$ distinct items.

[Note: I was thinking of the logs as base $e$ but any base gives the same result for the given expression.]

$\endgroup$

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .