Automatically 'brute force' a few bytes to recover a corrupt file

Question

Does anyone out there know of a way to brute force values at a particular offset in a file? It's 4 consecutive bytes which would need to be brute forced. I know the correct SHA-1 of the corrupt file. So, what I'd like to do is compare the complete file SHA-1, each time it changes the byte value.

I know the exact 4 bytes which were changed, because the file was given to me by a data recovery expert, as a recovery challenge. For those who are interested in knowing, the rar file has 4 bytes which were intentionally changed. I was told the offsets of the changed 4 bytes and the original SHA-1. The person said it's IMPOSSIBLE to recover the exact file in the archive once the 4 bytes were changed. Even if it was only a few bytes and you knew exactly where the corruption was located. Since it doesn't have a recovery record. I'm trying to see if there is a way for those particular 4 bytes to be filled in correctly so the file will decompress without error. The file size is around 5mb.

Example:

I uploaded photos so it's more clearly defined of exactly what I'm looking to do. I believe someone can post them here for me with more rep.

The example offset I'm focusing on is 0x78 where the first pic shows the value as CA I want the script to the take the value up by 1 so it becomes CB as shown in the second pic. I want it to keep increasing the value by 1 and then compare the whole file SHA-1 each time. Only making changes to those 4 bytes at the specified offset.

It will try CAC5C58A and compare the SHA-1. If doesn't match, then it will try CBC5C58A.Then once the first value reaches FF it will then go to 00C6C58A and so on. Basically, I would like it to be able to go from 00000000-FFFFFFFF but to also have the option to choose where you want it start and end. I know it could take some time but I would still like to try it. Keep in mind I know the exact offset of the bytes which are corrupt. I just need the correct values.

If you search on Google: "How to fix a corrupted file by brute force" There's a person that wrote a Linux program. However, it only works against the files included with the program. I'm looking for some way to use the same process with my file.

Welcome to Super User! I have edited your question to remove the request for a program, which would be off-topic. Can you edit your question to include (some of) the examples you saw? It's good that you have done research, but showing us exactly what research that is would be helpful :) — bertieb, Commented Apr 19, 2018 at 10:21
could i ask how you ended up with this file and how you can be sure that those are the only 4 corrupt bytes? — Edoardo, Commented Apr 19, 2018 at 12:38
Do you know the file format ? If you do you might be able to work out the correct values or limit the ranges, rather than trying to brute force them. In general, however, I'd suggest any corrupted file should be dumped for security reasons. — StephenG - Help Ukraine, Commented Apr 19, 2018 at 14:59
@eddyce I'm really interested in the second part of your question - why those 4 bytes? — Craig Otis, Commented Apr 19, 2018 at 15:26
Out of curiosity, how did the file get corrupted? And how do you know it was those four bytes? — JohnEye, Commented Apr 19, 2018 at 16:49

tripleee · Accepted Answer · 2018-04-20 08:44:43Z

Here's a small Python program which does what you seem to be describing.

#!/usr/bin/env python3
from hashlib import sha1

with open('binaryfile', 'rb') as bin:
    binary = bin.read()

base = 0x0078
# ... is not valid Python; add more sequences, or take it out (or see below)
for seq in [[0xCA, 0xC5, 0xC5, 0x8A], [0xCB, 0xC5, 0xC5, 0x8A], ...]:
    copy = binary[0:base]
    copy += bytes(seq)
    copy += binary[base+len(seq):]
    if sha1(copy).hexdigest() == '9968733ce3ff0893bbb0a19e75faaf2fb0000e19':
        print('success with bytes {0}'.format(seq))
        break
else:
    print('no success')

UnOnly briefly tested; please ping me if you find typos.

The base specifies where to try to apply the four bytes, and the long string '996873... is the hex representation of the expected SHA1. The line for seq in... defines the bytes to try; and of course replace 'binaryfile' with the path to the file you want to attempt to salvage.

You can replace the literal list [[0xCA, 0xC5,...]] with something to actually loop over all possible values but it's basically just a placeholder for something more useful because I'm not really sure what exactly you want there.

Something like for seq in itertools.product(range(256), repeat=4)): will loop over all possible values from 0 to 2³²-1. (You will need to add import itertools near the top then.) Or perhaps you could simply add an offset; update the script to replace the current for seq in with the following (where again the import needs to go before the main program);

import struct

for n in range(2**32):
    val=(n+0x8AC5C5CA) % 2**32  # notice reverse order
    seq=list(reversed(struct.pack(">I", val)))
    copy = ...

I reversed the order of the bytes so that it naturally increments from 0x8AC5C5CA to 0x8AC5C5CB but then the next increment will be 0x8AC5C5CC etc. The struct magic is to convert this to a sequence of bytes (had to look it up from https://stackoverflow.com/a/26920983/874188). This will start at 0x8AC5C5CA and go to 0xFFFFFFFF, then wrap around to 0x00000000 and climb back up to 0x8AC5C5C9.

If you have multiple candidate ranges you would like to examine in a particular order, maybe something like

for rge in [(0x8AC5C5CA, 0x8AFFFFFF), (0x00C6C58A, 0x00FFFFFF),
        (0x00000000, 0x00C6C589), (0x01000000, 0x8AC5C5C9)]:
    for val in range(*rge):
        seq=list(reversed(struct.pack(">I", val)))
        copy = ...

but then you'll need to make sure yourself that the (start, end) pairs in rge cover all of the space between 0x00000000 and 0xFFFFFFFF if you really want to examine all of it. (And again, notice that the range increments the last byte and that seq applies the bytes of the value in reverse, in accordance with your stated requirements.)

If you wanted to use two different base addresses, you quickly run up against the limits of what's feasible to do in your lifetime with brute force; but you could, for example, split the 4-byte number into two 2-byte parts and apply those at different offsets.

base1 = 0x1234
base2 = 0x2345

for seq in range(whatever):
    copy = binary[0:base1]
    copy += bytes(seq[0:1])
    copy += binary[base1+2:base1+base2]
    copy += bytes(seq[2:3])
    copy += binary[base2+2:]

Comments are not for extended discussion; this conversation has been moved to chat. — Journeyman Geek, Commented Apr 20, 2018 at 12:25

Community · Accepted Answer · 2020-06-12 13:48:39Z

No, no, no and again NO!

Seldom the answer you get is not what you expect.

Some questions for you:

Is it possible that an expert doesn't know that it is possible to brute force a string of for bytes and iteratively try the SHA-1 until it converges? No
Is it possible he forget it ? No
Is it possible that you cannot do it on a rar file? No
Is the other answer wrong? absolutely NO

So what? ... Time.

The point is that you have to change so few bytes... only 4!

What does it means? 256⁴ that is 256x256x256x256 possibilities, a really really big number.
If your computer were able to process 1 operation per second (substitution in the file + sha1)...
you should wait more then 136 years, or if you prefer more then 49710 days.

You're enough lucky, a 5MB pre-cached file (already loaded in ram and in the cache) asks only about 0.03 seconds (min 0.025s), on an old computer. That shrinks your expecting time down to 1242-1492 days (something more then 3 years).

It is true, BTW, that statistically you should have a positive answer in half of the time. Nonetheless you should wait till you will have tried all the possibilities to be sure that there is only 1 substitution that will give you the same SHA-1 checksum...

Now that IMPOSSIBLE sounds as "not possible in a WORTHWHILE amount of time".

How to proceed

A more proper answer to your technical question: when you speak about brute force it have not to be necessary blind brute force.

It is just stated in a comment in the other answer that you do not need to calculate the sha1 checksum on the part before the corruption. You do the 1st time and you save time for each successive iteration (maybe a factor 2 it depends from the position).
Something that can change the worthless of the effort is to write a parallel code that will run on the GPU. If you have a good graphic card you may have around 1000 cores that can compute for you in parallel (even more but they have a frequency lower than the cpu, but still they are a lot). If you are able to decrease the time from 1400 to 1.4 days maybe you can even do it.
A different approach can lead you to a faster solution.
You said it is a rar file. The rar file structure is divided into blocks. If you take count of it you can see where the corruption falls. If it is on the part of the data, on the part of the headers or on both. Then you can act consequently. For sake of simplicity let's suppose it is over the data:
you can do the brute force of your offset, check for each positive CRC of that block if it is even positive the SHA1 on the whole file. Again you can do a parallel code.

Final note

If they were 6 bytes instead of 4 you were out of the game with the present technology.

Great answer - one wouldn’t necessarily need to exhaust the entire space though because the rar itself in this example would not uncompress due to internal checks even if the sha1 worked with a duplicate hash. Hitting 4 bytes that solved the sha1 falsely AND an internal crc falsely would be very very unlikely. — rrauenza, Commented Apr 20, 2018 at 17:00
@rrauenza Thanks. BTW not only (the double check). Indeed the block should be shorter then the whole part from the corrupted bytes to the end of the file, and the CRC should be lighter to compute then the sha1 algorithm... — Hastur, Commented Apr 20, 2018 at 17:45
@rrauenza Do you know how I would go about getting the actual parallel code to run on the GPU? I have a good GPU. Thanks. — Sbt19, Commented Apr 21, 2018 at 0:38
No I do not. You could use multiple cpus by partitioning the search space though. — rrauenza, Commented Apr 21, 2018 at 0:43
@Sbt19 Whatever they said you about it google is not so scaring to use ;-). Search for (if nvidia) Cuda, brute force, sha1 and you will have a lot of hints, e.g. source code. BTW keep your attention high because browsing from that google path, oh my boy, may lead you on one of the dark sides of the net... :-). (Not on github...in other site that you can meet with this kind of researches). PS> There is a lot of scientific papers on related topics, e.g. this one... — Hastur, Commented Apr 22, 2018 at 6:47

Stack Exchange Network

Automatically 'brute force' a few bytes to recover a corrupt file

2 Answers 2

No, no, no and again NO!

So what? ... Time.

How to proceed

Final note

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged
data-recovery
repair
brute-force
.

Hot Network Questions

Automatically 'brute force' a few bytes to recover a corrupt file

2 Answers 2

No, no, no and again NO!

So what? ... Time.

How to proceed

Final note

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged data-recoveryrepairbrute-force.

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
data-recovery
repair
brute-force
.