4

I usually use WinMerge to view the differences between files, but in this case it doesn't help. The files I'm comparing are known to have different filenames, which is creating false positives when 2 files with the same document inside have different filenames.

I have a folder full of many directories representing all the vendors my company does business with, and they include many .pdf files of receipts & invoices. It's the master vendor list. The invoices & receipts are named such that the names don't make sense without the surrounding directory structure to provide context. For example here we have "Vendors/Company Foo/Product Bar/Invoice#3.pdf"

Then I have another folder with many receipts & invoices in it, that used to be maintained separately from the master vendor list, and was supposed to include a manually-created copy of every receipt & invoice that was entered into the appropriate entry in master vendor directory structure. These receipts & invoices were to have been renamed so they're easier for the accountant to read & know what they refer to. For example here we have "Taxes/CompanyFoo ProductBar.pdf".

I've searched for files of type .pdf in the top-level folder of the master vendor list, so that my search results include receipts & invoices from all the vendors in the directory structure. Then I copied these .pdf files to another folder on my Desktop, so I can compare them. I compared those files to the files in the 'taxes' folder using WinMerge to see if any of the files in the 'taxes' folder don't exist in the 'master vendor' directories, and vice-versa.

But WinMerge counts files as different just because their filenames don't match. I need to know if the file content is different despite what the filename is.

There are hundreds of these files & if any are in the 'taxes' folder that aren't in their corresponding 'master vendor' directory, I need to rectify that & file them correctly.

Can someone recommend a tool that can do this?

3
  • 1
    Why don't you use md5sum recursively? Two PDF files with the same checksum and same file size have extremely low chance of being different.
    – Benoit
    Commented Mar 18, 2012 at 19:35
  • possible duplicate of Which duplicate files and folders finders exist for Windows?
    – Daniel Beck
    Commented Mar 18, 2012 at 19:37
  • I found something in this thread that does what I need, in fact the answer to that thread is what it was. Thanks Daniel Beck! I don't know how to make that the answer to this one however. Commented Mar 18, 2012 at 23:43

5 Answers 5

2

If you have some kind of unix environment available (If you're on Windows, I suggest Cygwin) you can easily find duplicate files below the current directory with something like this:

find . -type f -exec md5sum '{}' '+' | sort | uniq -D -w 32

The output will be md5sum and name of every file that has at least one duplicate (same md5sum). Duplicates show up right after each other in alphabetical order. Exchange the . after find with the path you want to look under if it's not the current directory.

Edit:

Conversely, to get the files that have no duplicates, you can use

find . -type f -exec md5sum '{}' '+' | sort | uniq -u -w 32

That will only print files without any duplicate below the current directory.

2

I think the i-net PDF content comparer would be helpful.

It is now in Version 2.0 offering a GUI and flexible pricing options. There is still a free 30 days trial version where you can check on every aspect of the software.

Comparison Result

1
  • 1
    Looked do-able until I saw the price: 1295 US$. And the terms of the free trial make it unusable since I'm not a developer. Commented Mar 18, 2012 at 23:21
0
  1. You can (must, really) use xdocdiff plugin for WinMerge, if you compare content by eyes
  2. CompareIt! can render (so-so) and visualize in comparison windows pdf-files without additional plugins
  3. DiffPDF compare and show compared files even better (see screenshot on page), crossplatform

As alternative solution you can think about storing plain-text copies of each PDF under the same name (converted from with, f.e, pandoc) and compare text-versions only by any tool

0

Just did this is is what I used it worked swell and it was simple!

http://www.qtrac.eu/diffpdf.html

0

Try the app "PDF Compare", which compares both pdf document metadata and page images at the pixel level:

https://www.microsoft.com/en-us/store/p/pdfcompare/9n9dmzjbz2nl#

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .