14

I have a lot of duplicate image files on my Windows computer, in different subfolders and with different file names.

What Python script or freeware program would you recommend for removing the duplicates?

(I've read this similar question, but the poster there is asking about visual duplicates with differing file sizes. Mine are exact duplicates with different file names.)

1
  • 2
    Keep in mind that even if all the pixels are the same they still may have different EXIF information (modified by programs that handled the images at some stage) which will pose problems with most currently proposed solutions.
    – user12889
    Commented Jun 15, 2010 at 3:28

7 Answers 7

17

Don't Rely on MD5 sums.

MD5 sums are not a reliable way to check for duplicates, they are only a way to check for differences.

Use MD5s to find possible candidate duplicates, and then for each pair sharing an MD5

  1. Opens both files
  2. Seeks forward in those files until one differs.

Seeing I'm getting downvoted by people doing naïve approaches to file duplicate Identity, If you're going to rely entirely on a hash algorithm, for goodness sake, use something tougher like SHA256 or SHA512, at least you'll reduce the probability to a reasonable degree by having more bits checked. MD5 is Exceedingly weak for collision conditions.

I also advise people read mailing lists here titled 'file check' : http://london.pm.org/pipermail/london.pm/Week-of-Mon-20080714/thread.html

If you say "MD5 can uniquely identify all files uniquely" then you have a logic error.

Given a range of values, of varying lengths from 40,000 bytes in length to 100,000,000,000 bytes in length, the total number of combinations available to that range greatly exceeds the possible number of values represented by MD5, weighing in at a mere 128 bits of length.

Represent 2^100,000,000,000 combinations with only 2^128 combinations? I don't think that likely.

The Least Naïve way

The least naïve way, and the fastest way, to weed out duplicates is as follows.

  1. By size: Files with different size cannot be identical. This takes little time as it does not have to even open the file.
  2. By MD5 : Files with different MD5/Sha values cannot be identical. This takes a little longer because it has to read all bytes in the file and perform math on them, but it makes multiple comparisons quicker.
  3. Failing the above differences: Perform a byte-by-byte comparison of the files. This is a slow test to execute, which is why it is left until after all the other eliminating factors have been considered.

Fdupes does this. And you should use software that uses the same criteria.

4
  • 7
    It is literally more likely that your hard drive will magically destroy an image, than it is MD5 will collide. "Represent 2^100,000,000,000 combinations with only 2^128 combinations" - I agree with you here. If he had 2^100,000,000,000 pictures, MD5 (or almost any hash algorithm) would be bad.
    – Greg Dean
    Commented Jan 2, 2009 at 5:57
  • 4
    there is no guarantee, its just unlikely. Its not impossible. Its quite possible to have 10 files that all collide with each other, but are all entirely different. This is unlikely, but it can happen, so you must test for it. Commented Jan 2, 2009 at 10:25
  • 2
    file size, then MD5, and only then byte for byte check. Commented Jan 4, 2009 at 4:19
  • 3
    @Kent - I 100% aggree with you. It is laziness to disregard something because it is very unlikely, even as unlikely as we are talking about. I'd be annoyed if some of my data was destroyed just becasue the person who wrote the program thought that something was too unlikely to bother coding for.
    – Joe Taylor
    Commented Nov 22, 2010 at 16:06
13

It's a one liner on unix like (including linux) OSes or Windows with Cygwin installed:

find . -type f -print0 | xargs -0 shasum | sort |
  perl -ne '$sig=substr($_, 0, 40); $file=substr($_, 42); \
    unlink $file if $sig eq $prev; $prev = $sig'

md5sum (which is about 50% faster) can be used if you know there is no deliberately created collisions (you'd have better chance to win 10 major lotteries than the chance to find one naturally occurring md5 collision.)

If you want to see all the dups you have instead of removing them just change the unlink $file part to print $file, "\n".

3
  • 1
    You can use -print0 and xargs-0 to catch spaces as well, but find also has an -exec option that's useful here: find . -type f -exec shasum {} \; | sort ... Also: You shouldn't use @F (-a) because it won't work with spaces. Try substr instead.
    – geocar
    Commented Jan 2, 2009 at 3:42
  • Good call, geocar. Updated the answer with your suggestions.
    – obecalp
    Commented Jan 2, 2009 at 3:58
  • "md5sum (which is about 50% faster) can be used if you know there is no deliberately created collisions" - exactly
    – Greg Dean
    Commented Jan 2, 2009 at 6:34
6

I've used fdupes (written in C) and freedups (Perl) on Unix systems, and they might work on Windows as well; there are also similar ones that are claimed to work on Windows: dupmerge, liten (written in Python), etc.

1
  • Perl and Python software should work identically on Windows and *nix systems, assuming details of the filesystem don't matter.
    – CarlF
    Commented Nov 3, 2010 at 12:51
2

To remove duplicate images on Windows take a look at DupliFinder. It can compare pictures by a variety of criteria such as name, size, and actual image information.

For other tools to remove duplicate files take a look at this Lifehacker article.

2

One option can be Dupkiller.

DupKiller is one of the fastest and the most powerful tools for searching and removing duplicate or similar files on your computer. Complicated algorithms, built in its searching mechanism, perform high results — rapid file search. A lot of options allow to flexibly customizing the search.

enter image description here

0
2

A powershell way to scan for duplicate images ('.jpg','.png','.gif','.jpeg','.webp','.tiff','.psd','.raw','.bmp','.heif','indd','.svg' formats supported):

  • Check by SHA256 Hash
  • GUI dialog to choose files to delete and drives to scan
  • Please wait dialog
  • Hides Powershell console
  • Can be very slow
$sig=@'
public static void ShowConsoleWindow(int state)
{
  var handle = GetConsoleWindow();
  ShowWindow(handle,state);
}
[System.Runtime.InteropServices.DllImport("kernel32.dll")]
static extern IntPtr GetConsoleWindow();
[System.Runtime.InteropServices.DllImport("user32.dll")]
static extern bool ShowWindow(IntPtr hWnd, int nCmdShow);
'@
$hc=Add-Type -mem $sig -name Hide -Names HideConsole -Ref System.Runtime.InteropServices -Pas
$hc::ShowConsoleWindow(0)
[console]::title="Duplicate Image Scanner (c) Wasif Hasan | Sep 2020"
$eXt=@('.jpg','.png','.gif','.jpeg','.webp','.tiff','.psd','.raw','.bmp','.heif','indd','.svg')
@('system.windows.forms','system.drawing')|%{add-type -as $_}
$s=[windows.forms.form]::new();$s.size=[drawing.size]::new(400,850);$s.StartPosition="CenterScreen";$s.Text="Select drives to scan"
$drives=gdr -p "FileSystem"|select -eXp name
$top=20;$left=50;$drives|%{
$c=$_.split(" ")-join"_";$top += 20
iex "`$$($c) = New-Object System.Windows.Forms.CheckBox;`$$($c).Top = $($top);`$$($c).Left = $($left);`$$($c).Anchor='Left,Top';`$$($c).Parent='';`$$($c).Text='$($_)';`$$($c).Autosize=`$true;if('$_' -in `$drives){`$$c.Checked=`$true};`$s.Controls.Add(`$$c)"
}
$ok=New-Object System.Windows.Forms.Button;$ok.Text='OK';$ok.Top=770;$ok.Left=290
$ok.add_click({$s.Close()});$s.Controls.AddRange($ok)
$sa=New-Object System.Windows.Forms.Button;$sa.Text='Select All';$sa.Top=770;$sa.Left=200
$sa.add_click({$s.Controls|?{($_.Checked) -or !($_.Checked)}|%{try{$_.Checked=$True}catch{}}});$s.Controls.AddRange($sa)
$null=$s.ShowDialog()
$choices=$s.Controls|?{$_.Checked}|select -eXp Text
$i=0;$choices|%{$choices[$i]=$_+':\';$i++}
$f=[windows.forms.form]::new();$f.Size=[drawing.size]::new(600,100);$f.StartPosition="CenterScreen";$f.Text="Please wait"
$l=[windows.forms.label]::new();$l.Text="Please wait until the scan is complete........";$l.Font="Segoe UI,16";$l.AutoSize=$true;$f.Controls.AddRange($l)
$null=$f.ShowDialog()
$files=@();$hCols=@();$choices|%{
 dir $_ -r|?{$_.eXtension-in$eXt}|%{
   $h=get-filehash $_.fullname -a 'SHA256'|select -eXp hash
   if($h-in$hCols){$files+=$_.fullName}else{$hCols+=$h}
}};$f.Close()
$del=$files|ogv -t "Duplicate images (Hold CTRL and select the ones to delete)" -p
$del|%{rm "$_" -fo}
[windows.forms.messagebox]::Show("Thanks for using!","Duplicate image scanner","OK","Information")
1

Instead of DupliFinder, try the forked project instead, DeadRinger. We've fixed a ton of bugs in the original project, added a bunch of new features, and dramatically improved performance.

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .