2
$\begingroup$

I was playing with a function that I think is collision free and uninvertible assuming the hardness of integer factorization. I am unfortunately not as skilled at math as I would like to be, and do not know how to prove or disprove the function as being collision free. My only proof so far has been from unit testing all possible inputs, which for all 2 byte combinations has proven successful, but is obviously not concrete.

The algorithm iterates over a sequence of bits. It counts upwards through prime numbers, beginning at 2. For each contiguous '1' bit in the input bits, an exponent counter is incremented. When a '0' bit is encountered, the current prime is exponentiated and composed into the running output, the exponent counter reset to 0, and the current prime is set to the next prime.

For example:

"11001101"

Is interpreted as:

(2 ** 2) * (3 ** 0) * (5 ** 2) * (7 ** 1)

Is this function provably collision free, assuming the output is not truncated?

Is the function uninvertible for sufficiently large inputs/outputs?

Lastly, does the function become harder to invert if the output is modified (i.e. truncated or rotated)?

Edit in response to an answer: Let me rephrase the question to be more accurate with what I was actually doing. Suppose we iterate through integers, starting at 0 and incrementing by one. Convert each integer to it's binary form and process it in the manner described above. Do any output values produced this way ever collide?

$\endgroup$
3
  • 1
    $\begingroup$ try passing it a 512-bit input and see what happens $\endgroup$ Commented Jan 11, 2016 at 6:33
  • $\begingroup$ It looks like the empty string collides with 0. ​ ​ $\endgroup$
    – user991
    Commented Jan 11, 2016 at 8:38
  • $\begingroup$ Let us continue this discussion in chat. $\endgroup$
    – mikeazo
    Commented Jan 11, 2016 at 21:40

2 Answers 2

8
$\begingroup$

No, it is not collision-free. All possible sequences of 0's produce the same output:

0    --> (2 ** 0) = 1
00   --> (2 ** 0) * (3 ** 0) = 1
000  --> (2 ** 0) * (3 ** 0) * (5 ** 0) = 1
0000 --> (2 ** 0) * (3 ** 0) * (5 ** 0) * (7 ** 0) = 1

In fact, it can be seen that $f(s) = f(s||0)$, for every bit-string $s$.

This could be easily solved by initializing the exponents to 1, and not to 0.

Anyway, I'm afraid you have not discovered anything new. The Fundamental theorem of arithmetic states that any positive integer can be represented uniquely by a product of prime numbers:

$$n = p_1^{\alpha_1}p_2^{\alpha_2} \cdots p_k^{\alpha_k} = \prod_{i=1}^{k}p_i^{\alpha_i}$$

Your idea is basically a method for interpreting the bit string as the sequence of exponents $\alpha_1, \alpha_2, ..., \alpha_{k}$.

It is also very easy to invert (leaving aside the issue of the number of trailing 0's). Let us assume that $f(s) = n$. If $n$ is divisible by 2, then the first bit of $s$ is 1. Now you divide by 2 and if the result is divisible by 2, then the next bit of $s$ is also 1. Repeat until it is not divisible by 2, and you know the next bit is 0. Repeat this process with the divisibility of 3, 5, 7, etc.


With respect to the new question you added: yes, there are collisions.

For example: $f(1) = f(2) = f(4) = ... = f(2^k) = 2$

In fact, for any integer $n$, it can be seen that $f(n) = f(n\cdot 2^k)$, for every integer $k>0$, which means that for each integer there are infinite collisions.

$\endgroup$
5
  • $\begingroup$ Not accepting as answer: padding 0 with insignificant 0s arguably makes it a different number. Using 1 for the starting exponents makes lots of numbers unexpressable (you cannot express a number that does not "contain" 3 or 5 in it's factorization). I did not ask about or claim the origins of the idea. I am aware that the inversion is the process of factoring, which is why I mention sufficiently large inputs/outputs in regards to feasibility of inversion. Lastly, you explicitly leave aside my final question of how truncating or modifying the outputs modifies the feasibility of inversion. $\endgroup$
    – Ella Rose
    Commented Jan 11, 2016 at 17:19
  • $\begingroup$ I appreciate you updating your answer for me. As such, I hate to do this to you, but there is one more adjustment to my question that I should have made. Because computers actually operate on 8-bits at a time, I should have specified that the binary representation of the integers be a multiple of 8 bits. I apologize for my inaccuracies regarding what I am specifically asking about. Should I edit my question again and include that information? The reason I bring this up because in my code, F(1), F(2), and F(4) in fact do not result in the same output values, but 19, 17, and 13, respectively. $\endgroup$
    – Ella Rose
    Commented Jan 11, 2016 at 23:42
  • $\begingroup$ +1 for helping me to realize that modifying the initial exponent could be used as a key to the function $\endgroup$
    – Ella Rose
    Commented Jan 11, 2016 at 23:45
  • 1
    $\begingroup$ @E.Rose to process mathematical integers as computer 8-bit sequences, which is indeed common, you must specify the order, most commonly big-endian or little-endian (google either) -- although others are possible, such as the PDP-11 "NUXI" of yore. If big-endian, try 1 256 65536 16777216. Then 2 512 131072 33554432. PS: very few computers an ordinary person can program today actually process 8-bits at a time, although many address 8-bits at a time and some languages make that visible. $\endgroup$ Commented Jan 13, 2016 at 1:08
  • $\begingroup$ @dave_thompson_085 That is correct, the endianness matters too; Re: 8-bit: In my comment I should have said a multiple of 8 bits the first time as well as the second, I am aware that register sizes are larger on modern machines. Thank you for your example, that is more helpful then infinite variations of a zero edge case. $\endgroup$
    – Ella Rose
    Commented Jan 13, 2016 at 3:24
1
$\begingroup$

I'm answering the questions I posed. While cygnusv answer was correct for the idea I presented, it was not helpful to me as I had already identified and solved that issue, but failed to present the correct algorithm in my question. I'm not saying cygnusv is wrong; the answers presented are very much correct and I gave a +1. However, there were more questions that were not addressed.

The problem of zeros on the right can be fixed simply by appending a '1' bit or the string length to the input string. Now I was doing this already in the toy hash function that was using this idea, but it was applied before the unpack_factors function call so I neglected to mention it (this was completely my fault).

With this fix in place, the function gains what I will only refer to as collision resistance, with that statement being based solely on empirical testing by supplying inputs and recording outputs and not theoretical conjecture.

The hardness of factoring

In regards to the "hardness of factoring", this is something that really only applies to RSA style composition, where at least one of the factors is very far away from 0 on the number line.

Factoring numbers composed of large primes is slow because of the large number of primes to test. If a number is only composed of small primes, it can be factored efficiently even by a naive factoring algorithm.

For instance, unpacking basically any bitstring of all 1's would be equivalent to 2 ** N, where N is the length of the string. Even a basic factoring algorithm should solve this almost immediately, more or less regardless of how long the input string is.

However, unpacking a bitstring of lots 0's with a single 1 bit at the end would take significantly longer, and if the quantity of 0's was long enough to push the prime in question far enough to the right of the number line, then it would be "safe" in that it would never finish. However, in order to compose such a number, we would need to generate that prime, which is equally time consuming, and thus not a situation we can rely on.

Ouput truncation and rotation

Now, as for truncating the output, this has an interesting effect. It not only removes the collision resistance, it makes it trivial to calculate an infinite number of preimages for a given output. If you are only taking the last N bytes of output, then anything which unpacks and truncates to having those last N bytes as being the same is a preimage.

In other words, given an output, you can append any byte(s) you want to the left of it, factor it, and recover a valid preimage. While it does make it to where you won't know exactly what input produced the output, you can find an infinite set of inputs that it will reside in.

I would think rotating the output bits would increase the complexity of recovery by the max quantity of bits rotated this way when rotated by a data dependent amount, but am not totally sure. I couldn't think of any effects like what truncation has for rotation though, besides probably creating collisions.

$\endgroup$
0

Not the answer you're looking for? Browse other questions tagged or ask your own question.