0
$\begingroup$

Can I take two strings and produce a third, "small" string that contains all their information?

This question has obvious variants and generalizations, so let me describe a simple example of what I'm thinking about.

Motivation

I would like to be able to combine strings of the same length (although if letting the length vary makes the job easier that's more desirable, although I would guess it's in fact a harder problem) yet maintain the ability to recover the original strings efficiently. The solution does not have to be optimal and my main concern is storage (length of the storage string).

Example

Consider the randomly generated strings $s_1 = 123abc$ and $s_2 = 789xyz$, both with length $6$. Can I create a function that $f$ that maps $s_1$ and $s_2$ to a string $s$ of length $6$ that also preserves the information of $s_1$ and $s_2$? That is can I recover $s_1$ and $s_2$ with some function of $s$?

Just messing around with it, my intuition and says no. If the answer was yes, then it appears as if we could store an infinite amount of information in a finite space. So if that thinking is correct, then I'm curious if we can contain the information of $s_1$ and $s_2$ in a string $s$ of length strictly less than $12$? I say $12$, because I'm motivated by what I believe to be the fact that a concatentation like $s = s_1s_2$ would preserve all information of $s_1$ and $s_2$.


Tangentially, I'm unfamiliar with this topic, but I believe it to be part of information theory and I bet what I'm talking about has a specific name. So if someone could point me in that direction I would be appreciative. Furthermore, I'm wondering: could store strings efficiently via some other mechanism than just a resultant string? Thank you!

$\endgroup$

3 Answers 3

1
$\begingroup$

Your intuition is correct -- at least if your strings are made from symbols in some predetermined finite alphabet. It corresponds to a simple counting argument:

Suppose there are $N$ different symbols that can appear in your strings, and that $N\ge 2$:

Then there are $N^6$ different strings of length $6$, and therefore there are $(N^6)^2=N^{12}$ different pairs of two strings that your function can take as input. But there are only $N^6$ different outputs it can give.

Since $N\ge 2$ we have $N^{12}>N^6$, and therefore the pigeonhole principle says that $f$ cannot be injective.

So there are two different pairs $(s_1,s_2)$ and $(s'_1,s'_2)$ that map to the same output. The fact that the pairs are different mean that either $s_1\ne s'_1$ or $s_2\ne s'_2$ (or perhaps both of these). This means that either you cannot recover the first input from the output, or you cannot recover the second input from the output.


(If $N=1$, then there is only one possible string of length $6$ and the function that always returns that string is a valid pairing function. You can invert the pairing because both the left and the right input must have been the only possible string. But that's kind of an anomaly, and the trick stops working anyway if you want a pairing function that works on input strings of different lengths that you have not fixed in advance).


What you're looking at is essentially the fact that universal lossless compression is impossible, but I don't think that has a short slogan-like name.

$\endgroup$
1
$\begingroup$

I’m assuming your strings are made from a finite number of types of symbols. If so, the answer is no. If we have $n$ different types of symbols, the number of ordered pairs of strings $s_1$ and $s_2$ with length $m$ each is $n^m \times n^m = n^{2m}$. The number of strings of length $k$ is $n^k$, so if we could represent all such pairs as a distinct string of length $k$, we would require $n^k \ge n^{2m}$, i.e. the output string must have length at least $2m$ (e.g. via concatenation).

$\endgroup$
1
$\begingroup$

As other answers have explained, your intuition is correct, there is no way of storing two strings of length $n$ in a single string of the same length, nor even in a string of length $n' < 2n$.

However, if you know that your original strings $s_1$, $s_2$ will always belong to a strict subset of the allowed characters, then you could devise some "encoding" that take less than 2 characters for each pair of original characters.

And, if you cannot guarantee the above, but instead you know that some characters are very frequent and others are very rare, still you might be able to encode the two strings into a single string that , on average, takes less than $2n$ characters.

That is the essence of data compression, which is the main (but not only) application of the first Shannon theorem - which, yes, is one of the main themes of Information Theory.

$\endgroup$

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .