Searching for strings in a dd image, extracting data from it

Question

I am trying to recover the contents of a text file that became corrupt following a power outage. The file size is still correct (14 MB), but the text in the lower half of it has been replaced by spaces.

I found inspiration in:

Recovering a corrupted .txt file ("you can open your disk in a sector editor (booted from a CD or USB) and search the raw contents of your HDD for text that you know had been present in the file. You may get lucky.")
and https://www.quora.com/How-can-I-retrieve-data-which-I-previously-saved-inside-a-txt-file-Now-the-text-file-exists-but-the-data-is-not ("you can dump the raw data and search it. (...) use the “dd” command. Have a large empty flash drive ready to hold the file this will create. Then search that file for any bit of the text you can remember.")

My dd image is almost done. I imaged the whole partition, about 850 MB.

Now, I'm not sure how to search. Ideally (if it still exists...), I'd like to get back a big chunk of about 260k lines, the whole second half. Although if that's too big to extract, 10 000 lines around the middle of the file will have to suffice.

I must confess I'm not familiar with "offsets", hex/decimal, etc. Even the grep manual is not so easy to understand for me; although I will keep at it!

Anyway. What would you recommend? I'm not married to the dd image and grep; I'm welcoming any and all suggestions! Thank you.

Please add to your question how you created your dd image (no comment here). — Cyrus, Commented Jan 29, 2022 at 11:28

Kamil Maciorowski · Accepted Answer · 2022-03-10 09:23:35Z

If the file in question was a plain text file (as Linux understands it, i.e. UTF-8) and the filesystem you copied with dd is neither encrypted nor compressed, use strings on the image.

For each file given, GNU strings prints the printable character sequences that are at least 4 characters long (or the number given with the options below) and are followed by an unprintable character.

^{(source: man 1 strings)}

You want something like:

strings -aw -e S -n 512 ddimage >extracted

(or pv ddimage | strings -aw -e S -n 512 >extracted to see the progress).

Then extracted will be a file you can view with less, search with grep etc. In my tests -e S was crucial to detect UTF-8 text with multi-byte characters.

There are possible problems:

The data you seek may be fragmented, scattered around the file, not necessarily in sequence. There may be old versions, there may be fragments of other files (garbage, including text-alike fragments of binary files); all these possibly interleaved. It will be a textual jigsaw puzzle. Consider using the -s (--output-separator) option, but keep in mind if there are unrelated fragments strictly adjacent in the image then you won't get a separator between them, as if they were one bigger chunk.
If the filesystem you copied with dd was on SSD and TRIM was performed after the mishap but before you copied, then there's a risk the data you want to recover is gone. This is a bad scenario.

On the other hand, if the filesystem was on SSD and TRIM was performed before the mishap, and there was no TRIM between the mishap and the copy, then the TRIM may have wiped out unrelated old data, old versions of files etc., but not the data you're after. In effect you will get less garbage. This is a good scenario.

As you can see, SSD may be a disadvantage or an advantage. For HDD these scenarios do not apply. Virtual disks may support something similar to TRIM.
-n 512 tells the tool to print sequences at least 512 bytes long. The manual says "characters" but my tests with UTF-8 multi-byte characters show it's "bytes" for sure. The lower the number, the more garbage you will get. On the other hand you should not exceed the block size used by the imaged filesystem, which is at least 512 (the lowest common sector size for block devices). You said nothing about the filesystem, its block size may be e.g. 4096 or 8192. The point is your file may be fragmented and -n higher than the block size will miss a textual block, if it happens to be between non-textual data. If your file was tiny (smaller than -n you used) then you might miss it completely. Similarly you may miss the tail part of your desired file, if the part happens not to be adjacent to other text.

Still -n 512 should allow you to find almost all existing remnants of the file (unwanted garbage and fragmentation may be bigger problems than missing part(s)). Unless…
In the beginning I wrote "the filesystem […] neither encrypted nor compressed". An encrypted or compressed filesystem would store textual data not in its plain form, so strings would be useless. I guess some other features of some filesystems may lower your chances or cause some extra garbage.
extracted may be relatively huge. I performed strings -awe S -n 4096 on my system drive which is about 477 GiB, the output was over 10 GiB. Some filtering is advised, e.g. grep -av '[[:lower:]][[:upper:]]' is a reasonable filter (but note it will filter out lines containing kHz, kB, macOS or MacGyver); in my case it reduced 10+ GiB to 6 GiB.

I note your entire image is about 850 MB, not that huge. Your extracted won't be bigger. It may still be too big for "manual" inspection though, even after filtering. Eventually you will probably need to use a good text editor or pager (capable of handling large text files) to interactively search for strings you know were in the file you want to recover. This way you will hopefully locate relevant fragments.

Consider copying extracted to /dev/shm (or use vmtouch -l) to speed up your work.

Stack Exchange Network

Searching for strings in a dd image, extracting data from it

1 Answer 1

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged
linux
data-recovery
grep
dd
sectors
.

Linked

Hot Network Questions

Searching for strings in a dd image, extracting data from it

1 Answer 1

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged linuxdata-recoverygrepddsectors.

Linked

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
linux
data-recovery
grep
dd
sectors
.