Consider the following problem. You have a list containing $k$ total identifiers comprising $n$ unique identifiers such that $k \ge n$ (i.e. there can be duplicates). These identifiers are long strings, far longer than necessary to retain uniqueness of the $n$ items. These identifiers need to be used in a series of computations. To reduce the amount of I/O, you would like to perform a preprocessing step that shortens the identifiers as much as possible while retaining uniqueness. The original identifiers don't intrinsically encode any useful information, so you can happily discard them, you just need to perform a mapping of unique IDs in the original dataset to a new set of unique IDs in the smaller dataset.
You're given an alphabet $a$to use, e.g. the English alphabet with upper and lower case a...zA..Z and so you know the number of alphabet characters ($|a|$ = 52 in the example). The question is, given $a$ and $n$, how do you calculate the maximum string length $m$ needed to represent all $n$ unique items, assuming you can use variable-length strings of size in the interval $[1,m]$? So for example, $\{a, aa, bbb, aab, z, Z\}$ would all be valid identifiers with $m \ge 3$ and the alphabet a...zA...Z.
A simpler problem
There is a simpler problem that can be solved fairly easily, which is where the output strings are fixed length, i.e. every output string is length $m$. In this case, one can calculate the simple minimum $m_s$ as:
$m_s = \left \lceil{log_{|a|}n} \right \rceil$
E.g. for the binary alphabet $\left \lceil{log_{2}1000} \right \rceil = \left \lceil{9.97} \right \rceil = 10$ (to represent 1000 unique identifiers). But again this is for a fixed-length output string, where every string is 10 bits and you can't separately use strings of length $[1,9]$.
Num. unique identifiers with string of variable length $[1,m]$
If you're given $m$ up front then one can figure out how many unique identifiers $u$ are representable. It's:
$u = \sum_{i = 1}^{m} |a|^i$.
E.g. with a 52 letter alphabet and a variable length string with $m$ = 4 characters you get $52^1 + 52^2 + 52^3 + 52^4$. i.e. it's the sum of consecutive powers of the alphabet size, so $u$ translates to:
$u = \frac{|a|^{m + 1} - 1}{|a| - 1} - 1$
What I haven't been able to resolve is whether this helps with the problem of determining $m$ based on $n$.
A heuristic approach
The best I've been able to come up with so far is to say that you can upper bound $m$ by calculating $m_s$ ($m \le m_s$) i.e. anything which fits in fixed string length $m_s$ also fits in variable string length $m$. From here you can use a "binary search" approach to find the smallest acceptable $m$ between 1 and $m_s$ by calculating $u$ at each step. This takes time $O(log\:m_s)$.
What I'm interested to know is, is there a more efficient way to calculate $m$ given $n$ and $a$?