2

At my workplace, we have a high-performance computing-cluster managed by SLURM.

Some people's jobs spawn lots of processes with, as a federation, are one job. They also write the top-level controller poorly, so SIGINT causes the child processes to become zombies.

Because of the nature of this environment, it's not reasonable (for no actually valid reason) to expect them to fix this.

So I'm trying to make a submission wrapper which, at the end of the job, will kill all the child processes.

ps by default grabs all processes associated with the current tty session. However, SLURM does stupid stuff and ps isn't just the processes for THIS task of this job, but also OTHER tasks as well so it kills everything on a physical node whenever one job dies.

So, how do I get/kill all the jobs which are children of the current bash script?

1 Answer 1

1

The current process (the scripts process or shells, whichever the case may be) is stored in $$ Using that PID, you can use the pgrep command with the -P flag: pgrep -P $$
to get a list of all PIDs with the parent PID of $$

Here is a super simple proof-of-concept script:

#!/bin/bash

curpid="$$"
#launch 2 useless child processes
cat /dev/random > /dev/null &
cat /dev/random > /dev/null &
cpid=`pgrep -P $curpid`  && echo "$(basename $0) pid: $curpid; child pids:" $cpid
#kill the child pids
kill $cpid
# check if any child pids still exist
newcpid=`pgrep -P $curpid`
if [ $? -ne "0" ] || [ "$newcpid" != "" ]; then
  echo "no child pids left..."
else
  echo $newcpid
fi

Output:

{0} 02:34:57] $ ./test3.sh 
test3.sh pid: 7015; child pids: 7016 7017
./test3.sh: line 12:  7016 Terminated              cat /dev/random > /dev/null
./test3.sh: line 12:  7017 Terminated              cat /dev/random > /dev/null
no child pids left...

If you launch any child processes as another user (e.g. sudo), then you may not have permission to kill the process even though you are the parent. If you change one of the

cat /dev/random > /dev/null &

lines to

sudo cat /dev/random > /dev/null &

you will not be able to kill that process (assuming you launched the script initially with a regular user account)

Modified script (running one child as root and outputting diff info at end):

#!/bin/bash

curpid="$$"
#launch 2 useless child processes
cat /dev/random > /dev/null &
sudo cat /dev/random > /dev/null &
cpid=`pgrep -P $curpid`  && echo "$(basename $0) pid: $curpid; child pids:" $cpid
#kill the child pids
kill $cpid
sleep 0.5
#check on children
for i in $cpid; do
  echo -n "PID: $i; Orig PPID: $curpid; Cur PPID: "`/usr/bin/ps --ppid $i | grep -Eo '[0-9]{3,}'`
  echo
done

Output with one of the children running as root:

[{0} 02:53:51] $ ./test3.sh 
test3.sh pid: 8144; child pids: 8145 8146
./test3.sh: line 9: kill: (8146) - Operation not permitted
./test3.sh: line 10:  8145 Terminated              cat /dev/random > /dev/null
PID: 8145; Orig PPID: 8144; Cur PPID: 
PID: 8146; Orig PPID: 8144; Cur PPID: 8150

The child PID launched that was launched as root no longer considers the initial script as it's parent, instead it's the subshell created with the sudo call. The output from ps auxf
shows this clearly as well:

root      8146  0.0  0.0 244996  7444 pts/2    S    02:54   0:00 sudo cat /dev/random
root      8150  0.0  0.0 113828   744 pts/2    S    02:54   0:00  \_ cat /dev/random

This issue isn't the crux of your question, but its something to keep in mind.

5
  • Awesome awesome awesome. One follow up: if process 1 spawns 2 and 2 spawns 3 and 2 dies (because it's poorly written), does 3 get re-parented to 1 or does it then have a dangling parent? If the latter, I can always work-around in the 90% case by having a watchdog keep track of the lineage tree, but it'd be great if I could just get that information at the end.
    – iAdjunct
    Commented Jun 9, 2016 at 13:53
  • Also, you saved me from getting another tumbleweed badge. Thank you?
    – iAdjunct
    Commented Jun 9, 2016 at 13:53
  • Ok, doesn't work if 2 is killed, so I guess I'll have to use a watchdog script. Thank you!
    – iAdjunct
    Commented Jun 9, 2016 at 15:09
  • If the parent dies the child pid will get "reparented" to PID 1, so as you've seen this approach isn't sufficient. If you create pid files when spawning the children the watchdog cleanup might be become really simple. See xarg.org/2009/10/write-a-pid-file-in-bash
    – Argonauts
    Commented Jun 9, 2016 at 16:30
  • 1
    You can easily identify orphaned processes and their original ppids (maybe stored along with the pid in each childs pid file) and kill anything that now has a ppid of 1. There's more to it corner case wise, but should be painless.
    – Argonauts
    Commented Jun 9, 2016 at 16:34

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .