How can I compare the contents of .pdf files, excluding filenames from comparison?

Question

I usually use WinMerge to view the differences between files, but in this case it doesn't help. The files I'm comparing are known to have different filenames, which is creating false positives when 2 files with the same document inside have different filenames.

I have a folder full of many directories representing all the vendors my company does business with, and they include many .pdf files of receipts & invoices. It's the master vendor list. The invoices & receipts are named such that the names don't make sense without the surrounding directory structure to provide context. For example here we have "Vendors/Company Foo/Product Bar/Invoice#3.pdf"

Then I have another folder with many receipts & invoices in it, that used to be maintained separately from the master vendor list, and was supposed to include a manually-created copy of every receipt & invoice that was entered into the appropriate entry in master vendor directory structure. These receipts & invoices were to have been renamed so they're easier for the accountant to read & know what they refer to. For example here we have "Taxes/CompanyFoo ProductBar.pdf".

I've searched for files of type .pdf in the top-level folder of the master vendor list, so that my search results include receipts & invoices from all the vendors in the directory structure. Then I copied these .pdf files to another folder on my Desktop, so I can compare them. I compared those files to the files in the 'taxes' folder using WinMerge to see if any of the files in the 'taxes' folder don't exist in the 'master vendor' directories, and vice-versa.

But WinMerge counts files as different just because their filenames don't match. I need to know if the file content is different despite what the filename is.

There are hundreds of these files & if any are in the 'taxes' folder that aren't in their corresponding 'master vendor' directory, I need to rectify that & file them correctly.

Can someone recommend a tool that can do this?

Why don't you use md5sum recursively? Two PDF files with the same checksum and same file size have extremely low chance of being different. — Benoit, Commented Mar 18, 2012 at 19:35
possible duplicate of Which duplicate files and folders finders exist for Windows? — Daniel Beck, Commented Mar 18, 2012 at 19:37
I found something in this thread that does what I need, in fact the answer to that thread is what it was. Thanks Daniel Beck! I don't know how to make that the answer to this one however. — cdvonstinkpot, Commented Mar 18, 2012 at 23:43

Community · Accepted Answer · 2020-06-12 13:48:39Z

If you have some kind of unix environment available (If you're on Windows, I suggest Cygwin) you can easily find duplicate files below the current directory with something like this:

find . -type f -exec md5sum '{}' '+' | sort | uniq -D -w 32

The output will be md5sum and name of every file that has at least one duplicate (same md5sum). Duplicates show up right after each other in alphabetical order. Exchange the . after find with the path you want to look under if it's not the current directory.

Edit:

Conversely, to get the files that have no duplicates, you can use

find . -type f -exec md5sum '{}' '+' | sort | uniq -u -w 32

That will only print files without any duplicate below the current directory.

slhck · Accepted Answer · 2012-10-05 07:46:00Z

2

I think the i-net PDF content comparer would be helpful.

It is now in Version 2.0 offering a GUI and flexible pricing options. There is still a free 30 days trial version where you can check on every aspect of the software.

Comparison Result

edited Oct 5, 2012 at 7:46

slhck

230k71 gold badges621 silver badges603 bronze badges

answered Mar 18, 2012 at 19:36

Hamed

5,80810 gold badges32 silver badges39 bronze badges

1

Looked do-able until I saw the price: 1295 US$. And the terms of the free trial make it unusable since I'm not a developer.
– cdvonstinkpot
Commented Mar 18, 2012 at 23:21

Add a comment |

Lazy Badger · Accepted Answer · 2012-03-19 02:28:59Z

0

You can (must, really) use xdocdiff plugin for WinMerge, if you compare content by eyes
CompareIt! can render (so-so) and visualize in comparison windows pdf-files without additional plugins
DiffPDF compare and show compared files even better (see screenshot on page), crossplatform

As alternative solution you can think about storing plain-text copies of each PDF under the same name (converted from with, f.e, pandoc) and compare text-versions only by any tool

answered Mar 19, 2012 at 2:28

Lazy Badger

3,69414 silver badges12 bronze badges

Add a comment |

Micah Armantrout · Accepted Answer · 2012-03-19 02:58:56Z

0

Just did this is is what I used it worked swell and it was simple!

http://www.qtrac.eu/diffpdf.html

answered Mar 19, 2012 at 2:58

Micah Armantrout

6846 silver badges10 bronze badges

Add a comment |

rick · Accepted Answer · 2018-03-06 22:07:44Z

0

Try the app "PDF Compare", which compares both pdf document metadata and page images at the pixel level:

https://www.microsoft.com/en-us/store/p/pdfcompare/9n9dmzjbz2nl#

answered Mar 6, 2018 at 22:07

rick

1

Add a comment |

Stack Exchange Network

How can I compare the contents of .pdf files, excluding filenames from comparison?

5 Answers 5

Edit:

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged
comparison
diff
winmerge
.

Linked

Hot Network Questions

How can I compare the contents of .pdf files, excluding filenames from comparison?

5 Answers 5

Edit:

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged comparisondiffwinmerge.

Linked

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
comparison
diff
winmerge
.