1

I need to process >50,000 files using a third-party .exe command-line application. The application takes only one input file at a time, so I have to launch the application >50,000 times.

Each file (each job) usually takes about one second. However, sometimes the application hangs indefinitely.

I have written a Windows shell script that runs all the jobs serially, and checks every second to see whether the job is done. After 10 seconds, it kills the job and moves on to the next. However, it takes about 20 hours. I believe I can bring the total runtime down by a large amount if I run multiple jobs in parallel. The question is how?

In CMD I launch the task with Start, but there is no simple way to recover the process ID (PID) and therefore I cannot easily keep track of which instance has run for how long. I feel like I am trying to reinvent the umbrella. Any suggestions?

2
  • Questions seeking product, service, or learning material recommendations are off-topic because they become outdated quickly and attract opinion-based answers. Instead, describe your situation and the specific problem you're trying to solve. Share your research.
    – Xavierjazz
    Commented Aug 14, 2017 at 2:14
  • I have described my problem in detail in the post title and the first two paragraphs. The third paragraph talks about what I did. I changed the fourth paragraph but I don't know that the question is better now.
    – Mattia
    Commented Aug 14, 2017 at 2:20

2 Answers 2

0

Powershell is your friend.

https://serverfault.com/questions/626711/how-do-i-run-my-powershell-scripts-in-parallel-without-using-jobs asks something similar.

"Quick" and "robust" are of course subjective.

2
  • 1
    Thanks, Powershell is what I needed. I will add an answer below with the exact code I used, which I think is very reusable. I used the "Invoke-Parallel" tool mentioned in the answer you pointed to.
    – Mattia
    Commented Aug 14, 2017 at 20:05
  • I also removed "quick" and "robust" from the title. Thx
    – Mattia
    Commented Aug 14, 2017 at 20:16
2

Powershell did the trick, as indicated in quadruplebucky's answer. Here is the code I used. The second-last line (./xml2csv...) is the job itself. The rest of the script can be reused for any similar tasks.

# PARAMETERS
$root = 'D:\Ratings'
$folder = 'SP'

# Import Invoke-Parallel
 .".\Invoke-Parallel.ps1"

# Run in parallel
Get-ChildItem ".\$folder-xml" -Filter *.xml |
Invoke-Parallel -throttle 10 -runspaceTimeout 10 -ImportVariables `
  -ScriptBlock {
    $file = $_.BaseName
    echo $file
    cd $root
    (./xml2csv $folder-xml\$file.xml $folder-csv\$file.csv fields-$folder.txt -Q) | out-null
  }

Some notes:

  • The Invoke-Parallel function (aka cmdlet) can be downloaded here.
  • A runspace is what I would have called an "instance". -runspaceTimeout provides the maximum running time for each instance.
  • -throttle sets the maximum number of simultaneous running instances.

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .