0

I have a lot of recovered files of which many are invalid even though they appear to be ok by name and extension. This is expected.

Now I to need filter out those which are probably ok. I see to options:

For example, power point files (*.pptx) are actually zip containers that start with PK in the first two bytes. So the command

head --byte 2 filename

outputs PK for most of the good files whereas the bad files don't start with PK.

Question 1: How can I combine head with find to list out the files that match?

Another approach is the file command. It prints

Zip archive data, at least v2.0 to extract

for good power point files but simply

data

for bad files.

Question 2: How can I combine file with find to list out valid files?

There are also other file type but I can augment the technque if I only get the clue :)

Question 3: Are the more obvious ways to do this?

3 Answers 3

1

With find you could do something like

find . -type f -exec file "{}" + | awk  '/Zip archive data/ {print $1}' | sed 's/:$//'

which will print the filename (removing the trailing :) of files that file identifies as Zip archive data.

1
  • Thanks! awk is also a good solution, however I have found the -printf parameter to make things quite easy.
    – marlar
    Commented Oct 11, 2011 at 8:35
0

I would recommend using grep, since it is ment for searching for text in a file.

I am not certain of the particulars you will need, but this should get you started: http://www.computerhope.com/unix/ugrep.htm

1
  • I know grep quite well, but I can't find an option to anchor the expression to the beginning of the file, only the beginning of line (the ^ operator).
    – marlar
    Commented Oct 10, 2011 at 19:02
0

I have discovered the -printf parameter to find that makes many things much easier than using -exec. So this command will identify the (probably) good files and move them into a subdirectory called good:

find . -type f -printf "file '%p' | grep 'Zip archive data' && mv '%p' good \n" | sh

For Word and Excel documents you can use the string "Author:" as valid doc files seem to have this attribute.

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .