Removal of duplicate PDF files based revision level in file naming system

Question

I'm trying to create a batch, powershell, or anything a novice like myself could run easily to complete the following task. Any help would be greatly appreciated.

I have a few thousand pdfs, in a folder, that I'm trying to sort through. The problem is that the folder includes old and new revisions of the same pdf documents. I only want to keep the newest revision of each unique document. Revised version are indicated by the addition of a letter at the end of the filename (A-Z). Here is a sample list.

670BA-11-001.pdf
670BA-11-001A.pdf
670BA-11-001B.pdf
670BA-12-001.pdf
670BA-15-030C.pdf
670BA-49-120AC.pdf
670BA-49-120AD.pdf

All files start with "670BA"
The following numbers change. 670BA-XX-XXX.pdf
A file with no letter at the end of the filename indicates that it is the original revision
A file with a letter at the end of the filename indicates it is a revised version.
Revisions go from A-Z and then AA-AZ... so on and so forth.

Ideally I'd like the batch file to delete the older versions and leave the newest version of each unique document. In this case the output should look like:

670B-11-001B.pdf
670B-12-001.pdf
670B-15-030C.pdf
670BA-49-120AD.pdf

I was provided the following code, however I believe it is in unix (again forgive my lack of knowledge here). Would this work if I could convert it to windows command?

codes=`ls | sort | cut -d'-' -f2 | uniq`
for f in $codes; do old=`ls *-$f-* | head -n -1`; rm -vf $old; done

Here's what's going on;

ls | sort lists all the files in lexical order
cut -d'-' -f2 | uniq

splits the filenames on '-', grabs the 2 digit number from the middle, and gets rid of duplicates.

ls *-$f-* | head -n -1

lists all the files for a 2 digit code, except for the last one - which is the newest.

rm -f $old

deletes those old files, and the -f keeps it from failing of the list is empty.

SAMPLE RUN;

/tmp# touch 601R-11-001.pdf   601R-11-001B.pdf  601R-15-030C.pdf  601R-25-005E.pdf   601R-49-120AD.pdf  601R-11-001A.pdf  601R-12-001.pdf   601R-25-005D.pdf  601R-49-120AC.pdf

/tmp# codes=`ls | sort | cut -d'-' -f2 | uniq`

/tmp# echo $codes
11 12 15 25 49

/tmp# for f in $codes; do old=`ls *-$f-* | head -n -1`; rm -vf $old; done

removed '601R-11-001.pdf'
removed '601R-11-001A.pdf'
removed '601R-25-005D.pdf'
removed '601R-49-120AC.pdf'

Too bad the latest files date modified attribute would not simply suffice rather than the naming convention as that'd likely make this easier to do with a batch script. The issue I see when doing "simple" testing with PowerShell and batch both is that a name like 670B-11-001B.pdf and 670B-11-001AA.pdf it appears the one with just the B appears to sort/order improperly without accounting for the second letter so B comes after AA in those cases. There's probably a way to break down the characters and then sort but I wanted to ask about Date Modified last and if that'd work instead? — Vomit IT - Chunky Mess Style, Commented Mar 13, 2018 at 19:59

Ben N · Accepted Answer · 2018-03-14 16:10:55Z

If you have working Bash code (I haven't tested the script in your post), you can run it on Windows by installing Ubuntu on the Windows Subsystem for Linux. Once you have Ubuntu set up, you can open a Bash prompt using the Bash on Ubuntu on Windows item in the Start menu (if present) or by typing bash in the Run box. The Windows C:\ structure is at /mnt/c/ in the Bash environment.

Alternatively, you can use PowerShell!

$revPos = '670BA-XX-XXX'.Length
dir '670BA*.pdf' | group @{e={ $_.Name.Substring(0, $revPos) }} | % {
    $revs = $_.Group | % { $_.Name.Substring($revPos).Split('.')[0] } | group Length | sort -Descending -Property @{e={ [int]$_.Name }} | % { $_.Group | sort -Descending }
    $fileSet = $_.Name
    $revs | % { $fileSet + $_ + '.pdf' } | select -Skip 1 | del
}

Let's break it down by line and pipeline component:

For convenience, store the length of the part that identifies the document, i.e. the index of the revision. This assumes that the document identifiers are always the same size.
Get all file sets.
- Get all files in the current directory that start with 670BA and are .pdfs.
- Group them by the first part of the name, the document identifier. The business with the @{e={ is a custom property.
- Iterate over the groups.
Get a sorted list of revision IDs for the current group.
- The Group property is on the output objects of the group command.
- For every file object included in the group, select the part of its name after the document identifier but before the period in .pdf. This is the revision identifier. If a file is not revised, this will be a zero-length string.
- Group the revision IDs by length.
- Sort the group objects (not the items in them) by the length of their member strings. The Name property of the group holds the value of the property that was used to group the objects.
- For each of those group objects, sort their members alphabetically. This collapses all the groups together into the $revs variable, sorted according to your versioning system.
Store the Name value of the file group in a different variable to keep it accessible, since other for-eaches (%) will shadow the $_ variable.
Delete all but the newest revision in the document group.
- Use the entries in the $revs list.
- Recompose the full filename for each revision identifier. $_ now holds revision identifiers from $revs.
- Skip the first entry since it's the newest one, the one we want to keep.
- Delete the files corresponding to all entries remaining in the pipeline. If you want to test the script without deleting anything, add a space and the -WhatIf switch at the end of this line. In what-if mode, del will just print what it would have done.
End the document group iteration.

To use the script, save it as a .ps1 file, e.g. revnewest.ps1. If you haven't already, follow the instructions in the Enabling Scripts section of the PowerShell tag wiki. Then you can put it in your document folder, open PowerShell there, and run it like this:

.\revnewest.ps1

Thank you Ben N, this is working flawlessly and is going to make life much easier!! I have to go through several hundred of these pdfs monthly, now it won't take more than a few seconds. — Rosco, Commented Mar 15, 2018 at 19:01

Stack Exchange Network

Removal of duplicate PDF files based revision level in file naming system

1 Answer 1

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged
pdf
powershell
batch
script
shell-script
.

Hot Network Questions

Removal of duplicate PDF files based revision level in file naming system

1 Answer 1

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged pdfpowershellbatchscriptshell-script.

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
pdf
powershell
batch
script
shell-script
.