This is the reference implementation for the PDQ hashing algorithm, FB-ified (buck build, etc.).
Please see https://github.com/facebookexternal/ThreatExchange-PDQ/
pdq-photo-hasher.cpp
-- Runs the PDQ hashing algorithm and produces hashes+metadata+timings. Usemlr
(https://github.com/johnkerl/miller) to slice & dice the output if you like.hashtool256.cpp
-- Various operations on 256-bit hashes for operational use. Seereg_test/run
for sample invocations andreg_test/expected/out
for sample output.clusterize256.cpp
-- Given a list of hashes+metadata, clusterizes it against itself.mih-query.cpp
-- Given two lists of hashes+metadata ('needles' and 'haystack'), looks up each of the former in the latter.
These are all Ruby scripts.
htmlify-photos.rb
-- Reads the output ofpdq-photo-hasher
and writes simple HTML for visualizing.htmlify-photos-plain.rb
-- Same but with no text, only images. For the "Definition and evaluation of quality metric" section of the PDQ hashing document, I didpdq-photo-hasher ...
to a file, then piped that tomlr sort -n quality
, then piped that tohtmlify-photos-plain.rb
and screenshotted that.htmlify-pairs.rb
-- Same but pairs up images from two files. Used for the part of the PDQ hashing document which pairs up video frames with/without watermarks.htmlify-clusters.rb
-- Reads the output ofclusterize256
and writes simple HTML for visualizing.htmlify-clusters-plain.rb
-- Same but with no text, only images. See for example the "Clustering results" section of the PDQ hashing document.
- Test run of the hasher:
$ ./pdq-photo-hasher reg_test/input/dih/bridge-*jpg
f8f8f0cce0f4e84d0e370a22028f67f0b36e2ed596623e1d33e6339c4e9c9b22,reg_test/input/dih/bridge-1-original.jpg
b0a10efd71cc3f429413d48d0ffffe12e34e0e17ada952a9d29684210aa9e5af,reg_test/input/dih/bridge-2-rotate-90.jpg
adad5a64b5a142e55362a09057dacd5ae63b847fc23794b766b319361fc93188,reg_test/input/dih/bridge-3-rotate-180.jpg
a5f4a457a48995e8c9065c275aaa5498b61ba4bdf8fcf80387c32f8b0bfc4f05,reg_test/input/dih/bridge-4-rotate-270.jpg
f8f80f31e0f417b00e37f5cd028f980fb36ed02a9662c1e233e6cc634e9c64dd,reg_test/input/dih/bridge-5-flipx.jpg
8dad2599b1a1bd1853625f6553da32a1e63b7280c2374b4866b366c91bc9ce77,reg_test/input/dih/bridge-6-flipy.jpg
f0a1f102f1dcc0bd9c5309720fff018de34ef1e8ada9a956d2967ade0ea91a50,reg_test/input/dih/bridge-7-flip-plus-1.jpg
a5f05ba8a4896a17c106a3da5aaaab07b61b5b42f8fc07fc83c3d0740bfcb0fa,reg_test/input/dih/bridge-8-flip-minus-1.jpg
- Get some image files, some matching the above and some not.
pdq-photo-hasher --details /path/to/your/*.jpg > haystack.hsh
.clusterize256 haystack.hsh > haystack.clu
and verify that similars are clustered together.- Run
htmlify-clusters.rb < haystack.clu > haystack.html
and open that file in your browser to take a look at the images. - Get some more image files.
pdq-photo-hasher --details /path/to/more/*.jpg > needles.hsh
.mih-query needles.hsh haystack.hsh
and verify that matches are found as desired.