I recovered a few TB of data, yielding over a million files. Much of what was recovered is trash, but a small percentage are very valuable files that are buried in the mess. Question: How can I cull the recovery results to a manageable "signal to noise" ratio to facilitate evaluating individual files?
Background
I used Foremost, Testdisk, dd, and Photorec to recover the data. Foremost and the like pull backup data by filetype. What you end up with is millions of files sorted by type in a subdirectory. For example, I open one directory and I'm faced with 250,000 JPEGs.
To complicate matters, these programs get some things wrong. For example, I set it to recognize CSS files by looking for the code snippets #* {
, .* {
, #*{
, and .*{
, but there WILL inevitably be some false positives for so simple of a filter.
The logical (methodical) approach to look at this by filetype. For example, I have to assess each file identified as "css" to see if it even is CSS; 99.9% are NOT.
I am trying to make the task more manageable by culling files that can be identified as deletable (i.e., valueless, corrupt, unusable/unrecoverable), ideally by automated means or at least in bulk.
File Characteristics
Here's an estimate of the numbers of files:
type, sizeG, approx % corrupt, approx % I may end up needing
jpg 10G ~25% corrupt ~0.0025% keep
js 13G ~0% corrupt ~0.025% keep
less 1G ~0% corrupt ~0.001% keep
mov 21G ~0% corrupt ~50% keep
mp3 13G ~50% corrupt ~2% keep
mp4 1.5G ~50% corrupt ~25% keep
pdf 11G ~20% corrupt ~0.125 keep
(The PDF files are picture albums; each one is a collection of dozens of
important pics. Examining them is hard and time-consuming to do manually.)
wma 2.7G ~90% corrupt ~0.01% keep
zip 2G
Here are some of the procedures I'm using now
JPG
Using Windows, I can open the folder, view thumbnails, and ones that aren't visible won't load.1 I can shift-click them and delete. Unfortunately, it's a 35G directory of 320,000 .jpg
files, with no organization.
__________
1 It would be more accurate to say that there is a high correlation between intact files and those that will display a recognizable thumbnail.
Using Windows, I right-click on about 1,000 files, then click open, wait 5 minutes, then do some operations on the first one to combine all open files into a single file, which takes about 10 minutes, then bulk close the 1,000 files. Then open that super-large file and scroll, searching for real images. I highlight ~100 at a time individually, then extract them to a new file for archiving, and finally, delete that large file.
mp3, mp4, wma
Using Windows, I can't use Winamp for this one because one corrupt file kills Winamp every time. So I use VLC, put them in a long playlist, and listen. The bad files are skipped immediately, but it still takes a very long time.
zip
Using Windows, I think I have zip down. I select all zip files, right-click, WinRAR extract, or extract each as a separate archive. But clicking into each directory afterwards is a big job.
js, css, less
Using Windows, these are fairly easy to view quickly. I open the parent directory in Explorer, turn on file preview, select the first item, then use the arrow, delete, and enter keys on the keyboard. Still, there are 20,000 files.
Objective
I would like to know what methods can be used that are more efficient than this for sorting/filtering the aftermath of my data recovery so I can expedite this. Linux suggestions are VERY WELCOME. Even as a first step, it would help if I could identify corrupt files on the entire archive and delete those first, and then the 0 bytes files.