How can I efficiently view/validate/delete 1M+ files from data recovery?

Question

I recovered a few TB of data, yielding over a million files. Much of what was recovered is trash, but a small percentage are very valuable files that are buried in the mess. Question: How can I cull the recovery results to a manageable "signal to noise" ratio to facilitate evaluating individual files?

Background

I used Foremost, Testdisk, dd, and Photorec to recover the data. Foremost and the like pull backup data by filetype. What you end up with is millions of files sorted by type in a subdirectory. For example, I open one directory and I'm faced with 250,000 JPEGs.

To complicate matters, these programs get some things wrong. For example, I set it to recognize CSS files by looking for the code snippets #* {, .* {, #*{, and .*{, but there WILL inevitably be some false positives for so simple of a filter.

The logical (methodical) approach to look at this by filetype. For example, I have to assess each file identified as "css" to see if it even is CSS; 99.9% are NOT.

I am trying to make the task more manageable by culling files that can be identified as deletable (i.e., valueless, corrupt, unusable/unrecoverable), ideally by automated means or at least in bulk.

File Characteristics

Here's an estimate of the numbers of files:

type,   sizeG,   approx % corrupt,   approx % I may end up needing  
jpg     10G        ~25% corrupt        ~0.0025% keep
js      13G        ~0% corrupt         ~0.025% keep
less     1G        ~0% corrupt         ~0.001% keep
mov     21G        ~0% corrupt         ~50% keep
mp3     13G       ~50% corrupt          ~2% keep 
mp4    1.5G       ~50% corrupt         ~25% keep 
pdf     11G       ~20% corrupt         ~0.125 keep 
    (The PDF files are picture albums; each one is a collection of dozens of
    important pics. Examining them is hard and time-consuming to do manually.)
wma    2.7G       ~90% corrupt         ~0.01% keep
zip     2G

Here are some of the procedures I'm using now

JPG

Using Windows, I can open the folder, view thumbnails, and ones that aren't visible won't load.¹ I can shift-click them and delete. Unfortunately, it's a 35G directory of 320,000 .jpg files, with no organization.
__________
¹ It would be more accurate to say that there is a high correlation between intact files and those that will display a recognizable thumbnail.

PDF

Using Windows, I right-click on about 1,000 files, then click open, wait 5 minutes, then do some operations on the first one to combine all open files into a single file, which takes about 10 minutes, then bulk close the 1,000 files. Then open that super-large file and scroll, searching for real images. I highlight ~100 at a time individually, then extract them to a new file for archiving, and finally, delete that large file.

mp3, mp4, wma

Using Windows, I can't use Winamp for this one because one corrupt file kills Winamp every time. So I use VLC, put them in a long playlist, and listen. The bad files are skipped immediately, but it still takes a very long time.

zip

Using Windows, I think I have zip down. I select all zip files, right-click, WinRAR extract, or extract each as a separate archive. But clicking into each directory afterwards is a big job.

js, css, less

Using Windows, these are fairly easy to view quickly. I open the parent directory in Explorer, turn on file preview, select the first item, then use the arrow, delete, and enter keys on the keyboard. Still, there are 20,000 files.

Objective

I would like to know what methods can be used that are more efficient than this for sorting/filtering the aftermath of my data recovery so I can expedite this. Linux suggestions are VERY WELCOME. Even as a first step, it would help if I could identify corrupt files on the entire archive and delete those first, and then the 0 bytes files.

Be careful. Just because a preview program cannot open a file it doesn't mean it's wholly corrupt. It might just be a small portion, or it might be as something as simple as the wrong extension (file type) was applied to the recovered file. For data you don't care about that much it shouldn't be an issue, for for files you suspect are very important I'd double check them using other software (e.g. the proper programs for opening them or conversion utilities). — Gene, Commented Aug 15, 2015 at 22:17
Brian, I fixed the closure issue with the software rec. You might attract more readers if you shorten the question by trimming the commentary to focus on the technical task. You're not the first person to face this kind of problem so I would think your question would have broad appeal. People will answer if they have something to offer. Recognize, though, there may not be a lot of good solutions. — fixer1234, Commented Aug 18, 2015 at 16:10
@BrianThomas, No, requests for software recommendations are off-topic here. However they're exactly what the Stack Exchange Software Recommendations site is looking for. — I say Reinstate Monica, Commented Aug 20, 2015 at 1:58
@i-say-reinstate-monica are you saying that my post is about asking for software recommendations? im a bit confused by fixer1234 opening sentence there, and your entire comment. I really dont understand what your talking about. — Brian Thomas, Commented Mar 3, 2021 at 20:04

fixer1234 · Accepted Answer · 2015-08-21 06:45:36Z

I won't rehash what Scott has covered, and his discussion about potentially recovering corrupted files (or portions of them) is an area to explore. One point I'll add: some document formats will look like mostly garbage if you examine the raw file. However, the text content is often in large, recognizable chunks. Even if the file has been corrupted or parts are missing, you may be able to manually extract much of the text. But as Scott notes in a comment, this would be for salvaging specific content you've identified as valuable; it wouldn't be part of an automated process to deal with the files in bulk.

Strategy

A task of this scope will take forever, and you are likely to run out of steam before you finish. You want to get the most value that you can for your efforts. Let me suggest an approach. A caveat, though. I'm not aware of any off-the-shelf, automated solution. You may be able to help here or there with things like scripts, but this will be largely a manual process. The key is to make that process as efficient as you can.

Prioritize. Invest your time where you are likely to get the most benefit. This means adhering to a good process rather than doing things with random files. It also means deciding to not do anything with low-potential files, at least until you have finished with the high-potential files.
Make your time count. There's an old concept of time management: if you handle something, do something with it. Decide whether you can quickly finalize it. If so, do it. If not, prioritize it for later completion or discard it. But don't spend time on a file and then just throw it back in the pot.
Create organization. Part of the process will be looking at files and putting them into specific directories for later processing. Don't be afraid to create a lot of directories. Use them as a way to retain information you've learned about the files, to group them by potential, etc. Look for ways to quickly identify actionable information about the files and move them into directories to aggregate them for later processing. This will create pools of files to work on and leave behind files that will take more work.
Use the organization. Handle similar files in batches rather than processing a mix of different kinds of files with different needs. The repetition and similarity will make you more efficient at it.

Suggested Process (triage)

Think in terms of three categories: easy, high-potential files; discardable files; and files that will take more work. Process them in that order. When you get to the third category, repeat the process.

Identify all of the recognizable, uncorrupted files and pull those off into a working pool. Aggregate them by filetype. These are the first ones to work on. I would sort them by size (see discussion farther down). Start with the largest files and work your way down. Go through these using the normal applications software to see what they are and what you want to keep. Or, just save them all and move on to the next step.

For images, use a tool like Irfanview. It has features like thumbnail views and batch processing that can really speed up dealing with image files in quantity. I would move a few hundred files at a time into a directory to work on them.
Of the remaining files, cull the smallest ones, as described below, to make the collection more manageable.
That leaves files that will take more work. Look for characteristics that will allow you to identify another collection of files with potential. A characteristic I would start with is size. The largest files have the most potential to contain useful content, are less likely to be file recovery flotsam, and are less likely to have been created through corruption. The quantity of these is also likely to be manageable. Start with them and work down the pool of remaining files by size.

Using File Size - Small Files

I have been in similar situations (but thankfully never the quantities you're dealing with). One way I found to make fast headway is file size. If the recovered files contain a typical mix of stuff, the size distribution will be skewed and there will be a large tail of small files, much of which will be discardable.

Many filetypes contain a lot of "overhead", like header information. Things like Word documents, and sometimes PDFs, can also contain things like embedded fonts. So even one byte of content requires a file of a certain minimum size. You can determine that minimum size by creating a one-byte file of each type.
For image files, look at very tiny ones and see what they contain. Use your file manager to sort by size, and then look at sample files as you work your way up in size. You will see that "good" images of minimal size contain stuff that you probably don't need to keep, like tiny snippets of artwork from web sites. Looking at samples will give you a good idea of the minimum size image that would be of interest.
For documents, consider the value vs. time to recover. You may have valuable snippets of text, like ideas or references you're saving. If that's the case, this will be less helpful. Otherwise, you are likely to find things like portions of saved drafts or very short segments of text. They may no longer be needed, or might be less work to recreate if you ever do need it than to examine tons of them and clean them up just in case. So you may be able to define a smallest size of interest.

Once you've done this exercise for a filetype, sort the directory by size and select everything smaller than your minimum. You can move/archive them to a backup before deletion. You might want to browse these for content (low priority) before deletion, or have another shot at review and recovery if your file selection accidentally gets messed up.

I often found, particularly in Windows, that the file manager got indigestion when the number of files passed a certain threshold. You may find it faster and more reliable to do reasonable numbers of files (no more than a few hundred), at a time.

Ideas for identifying filetypes

When you get to the point that you're working with unrecognized filetypes, you're largely into manual territory. They're way down on the diminishing returns curve. However, here's some ideas for identifying filetypes for unrecognized files:

Many filetypes have header information. Open the file in a text editor and look at the first "paragraph" of content.
If you have a collection of gigantic files of unrecognizable type, the size, alone may be a clue. Gigantic files are likely to be backups, archives, photo albums, or videos.
Once you've aggregated recognizable files by filetype, look at the size range. That will be a clue that may be useful in aggregating unrecognized files.

(1) Your remark “text content is often in large, recognizable chunks” suggests that the Unix file and strings commands might be of some value in this process, but I’m not sure (a) how much they can add to what the OP has already done, (b) how well they can scale to such a huge operation (i.e., how much they can contribute to an automation solution), or (c) how useful strings can be if recovering text isn’t a primary objective. … (Cont’d) — Scott - Слава Україні, Commented Aug 21, 2015 at 4:44
(Cont’d) … (2) The idea of backing up all the files is probably a good one. I suspect that, if I were in this situation, and I made the backup, I would never use it … but if I deleted the files without making a backup, I’d wish that I had. Murphy’s law. YMMV. — Scott - Слава Україні, Commented Aug 21, 2015 at 4:45

Community · Accepted Answer · 2017-03-20 10:17:30Z

Find zero-length files? That’s easy. In Windows, type size:0 into the Windows Explorer search box (and, when you find them, you can delete them). In Linux, you can do
```
find . -type f -size 0 -exec rm {} +
```
or, if you have GNU find, you can do
```
find . -type f -empty -delete
```
I’m not sure I understand everything you’re saying, especially with regard to zip, but simple, repetitive actions like checking and/or deleting directories are typically easy to script in either Windows or *nix.
I don’t know anything, really, about the tools you used (Foremost, Testdisk, and Photorec), but you say that the majority of the files you got are trash. At the risk of increasing your workload substantially, I will repeat Gene’s comment: you might want to consider the possibility that some of the files you got contain valuable information, but are corrupted just badly enough that standard tools can’t process them, but that have enough good structure that some other data repair tool might be able to fix them.
- For instance, I once had some image files that were simply truncated (cut off). Of course there was no way to recover the pixels that weren’t there, but I discovered that the standard image viewer software I was using was stopping a few thousand bytes short of the end of the file. I was able to fix the files so that all the pixels that were present, were displayed, resulting in about ten more scan lines becoming visible. That was a long time ago in a galaxy far, far away, so I can’t provide any more detail than that.
- Perhaps a better example is that a standard file viewing/editing tool is likely to fail if a file is missing its first 512 bytes, or if they are present but corrupted. It may be possible to recover from such damage and reconstruct the missing data.
Just something to think about.
You might want to look at Automating the scanning of graphics files for corruption. Be sure to check the links under the Linked and Related headings on the right.

This crap is very helpful!!!! NICE. and now i can better understand what im asking at the same time. Ill revisit this as the days come, and answer up here, to keep the momentun. in the mean time, ALL OTHER RESPOSNES ARE WELCOME PLEASE. we need more anwsers! This is broad, but ill refind my question after i understand better here. Thanks much! — Brian Thomas, Commented Aug 20, 2015 at 6:23

GabrielB · Accepted Answer · 2021-01-23 06:37:58Z

It should be noted that, when recovering files with the “raw file carving” method, also called “file signature search”, the amount of corrupted files can vary depending on which file types were activated during the scan. For instance, Photorec will often truncate (thus corrupt) perfectly valid and contiguous video files which happen to have a random fake JPG signature within their stream. If JPG detection is disabled, the same file on the same source drive will be extracted flawlessly. The detection algorithm seems rather crude in that regard, I reported that issue on the dedicated forum a few months ago. I also made a report on the particular topic of recovering video files on a video related forum. At least Photorec is free, and from what I could read, expensive professional data carving utilities like Foremost don't fare significantly better. So, to get the best possible recovery for any given file type, it may be wise to disable all other file types, at least those which tend to yield a lot of false positives (in my experience : JPG, MP3, MPG, ZIP, RAR, OGG ; on the other hand, WMV or MKV files have a very distinct and quite long signature so the likelihood of getting false positives is very low — but even with a valid header a recovered file can be truncated). For the same reason, PDF files can also be recovered with “holes” corresponding to image files stored uncompressed within their stream.

Generally speaking, raw file carving should be considered as a last resort method, when recovery by means of filesystem analysis is not possible (if the filesystem was badly damaged), or doesn't allow to recover particular files of interest (because their metadata has been overwritten). If a file was fragmented on the source partition, then it's very difficult to reassemble its scattered chunks, which can be hundreds of gigabytes away. More advanced utilities are advertised as being able to reconstruct fragmented files with a much better success rate, but I have no experience with them.

As for assessing which files among a mountain of recovered files are valid and which are not, one method that is relatively simple is to use a duplicate detector, set to analyse only the first bytes, with a bunch of known valid files as reference. For instance, using DoubleKiller (which is my favorite duplicate detector, even though it hasn't been updated in a long time, as it's streamlined yet efficient and precisely customizable — it has one big caveat though : it doesn't recognize Unicode characters), to check JPG files, I would put a bunch of known good JPG files of various origins inside a new folder (or select an existing folder which already has a variety of JPG files in it — 10 to 20 should be enough, don't put your entire porn collection), drag-and-drop that folder to the “Recent” panel, then drag-and-drop the folder with recovered JPG files to the “Library” panel (files in “Library” are reported only if they match with at least one file in “Recent”) ; then, set the analysis to “first 10 bytes” (that's enough to check the signature — which can be something like “ÿØÿà JFIF” or “ÿØÿáyœExif” and other variations — and not too long to avoid checking the actual image's binary contents, which is always different for different files), with no size / name / date restriction. Files that will be listed as duplicates should be valid JPG files (at least they should have a valid JPG header, but they may still be corrupted or unreadable if a large chunk is missing elsewhere).

Stack Exchange Network

How can I efficiently view/validate/delete 1M+ files from data recovery?

Here are some of the procedures I'm using now

Objective

3 Answers 3

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged
data-recovery
file-management
.

Linked

Hot Network Questions

How can I efficiently view/validate/delete 1M+ files from data recovery?

Here are some of the procedures I'm using now

Objective

3 Answers 3

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged data-recoveryfile-management.

Linked

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
data-recovery
file-management
.