I'm currently refactoring a script that works well if executed in terminal directly but exits early due to a process check if executed from crontab. This early termination is caused by code using a ps command piped to several grep/grep -v commands. The intent is to check if a process is already running and, if so, not to execute this script again. I know the reason this code isn't working is because it's trying to catch all processes but doesn't grep -v out the /bin/sh -c <script name> process that crontab always uses to call scripts initially. In refactoring this code it just made sense to use something like pgrep over a ps piped to several greps.

Here's where my question comes in. My pgrep code works, I just don't fully understand why it works. When comparing the output of pgrep pgrep_test.sh | grep -v $$ to ps -ef | grep pgrep_test.sh there are additional processes that the pgrep command seems to remove. It seems to me that pgrep is grouping several PIDs together like, it understands and follows the PID/PPID relationship. The problem is I don't see anything about that written in the pgrep manpage.

I think to understand why pgrep is working in my code I need to better understand how pgrep groups PIDs/PPIDs. Here's the code I'm using to test this, it's executing via crontab:

* * * * * user1 /tmp/inferencing/pgrep_test.sh 2>&1 >> /tmp/inferencing/test.log

The code itself:


# Test how grepping for PIDs works when script is called from crontab

echo "+++++$(date +"%b %H:%M:%S") Beginning pgrep script+++++"

pgrep pgrep_test.sh | grep -v "$$" > /dev/null 2>&1
# RC=1 -- No additional processes running
# RC=0 -- Additional processes running
echo "Return code: $RETCODE"
if [ $RETCODE -eq 0 ]; then
    echo "Additional test processes exist, exiting script"
    echo "$(ps -ef | grep pgrep_test.sh)" ; sleep 1
    echo "$(pgrep -a pgrep_test.sh | grep -v \"$$\")"
    exit 1
    echo "No additional processes found, continuing execution"
    echo "$(ps -ef | grep pgrep_test.sh)" ; sleep 1
    echo "$(pgrep -a pgrep_test.sh | grep -v \"$$\")"
    sleep 90

I've used a 90 second sleep in the code to ensure that a cronjob running every minute will fail out every other time. Here's what the logfile looks like with some additional annotations in form of comments.

First with no additional processes running:

"+++++Nov 17:04:01 Beginning pgrep script+++++"
# Sleeping every 90s means we should have alternating "no additional
# processes found" and "additional processes found" logs each execution
Return code: 1
No additional processes found, continuing execution

# ps -ef | grep pgrep_test.sh
# Initial /bin/sh -c call crontab executes
user1   12956 12954  0 17:04 ?        00:00:00 /bin/sh -c /tmp/inferencing/pgrep_test.sh 2>&1 >> /tmp/inferencing/test.log
# Child process spawned from 12956 (Shouldn't this be PID for $$?)
user1   12957 12956  0 17:04 ?        00:00:00 /bin/bash /tmp/inferencing/pgrep_test.sh
# What even is PID 12961? PPID 12957 is the /bin/bash call but these two commands are identical otherwise
# Can't be the pgrep or sleep as these have not executed yet
# Technically there's more than 1 process here now so pgrep should be giving a return code of 0 and exiting the script
# Why does it work correctly here? How does pgrep know to group these?
user1   12961 12957  0 17:04 ?        00:00:00 /bin/bash /tmp/inferencing/pgrep_test.sh
user1   12963 12961  0 17:04 ?        00:00:00 grep pgrep_test.sh

# pgrep -a pgrep_test.sh | grep -v $$
12957 /bin/bash /tmp/inferencing/pgrep_test.sh
12965 /bin/bash /tmp/inferencing/pgrep_test.sh

Next with a matching process already running:

"+++++Nov 17:05:01 Beginning pgrep script+++++"
# Since other process is still sleeping, we correctly get a return code of 0 and stop script execution
Return code: 0
Additional test processes exist, exiting script

# crontab process for (now sleeping) original script call
user1   12956 12954  0 17:04 ?        00:00:00 /bin/sh -c /tmp/inferencing/pgrep_test.sh 2>&1 >> /tmp/inferencing/test.log
# Sleeping process
user1   12957 12956  0 17:04 ?        00:00:00 /bin/bash /tmp/inferencing/pgrep_test.sh
# New crontab process
user1   13733 13594  0 17:05 ?        00:00:00 /bin/sh -c /tmp/inferencing/pgrep_test.sh 2>&1 >> /tmp/inferencing/test.log
# New main bash process
user1   13734 13733  0 17:05 ?        00:00:00 /bin/bash /tmp/inferencing/pgrep_test.sh
# Second main bash process again -- this happens every time
user1   13738 13734  0 17:05 ?        00:00:00 /bin/bash /tmp/inferencing/pgrep_test.sh
# No grep -v grep, this doesn't show up to pgrep anyways
user1   13740 13738  0 17:05 ?        00:00:00 grep pgrep_test.sh

# pgrep -a pgrep_test.sh | grep -v $$
12957 /bin/bash /tmp/inferencing/pgrep_test.sh
13734 /bin/bash /tmp/inferencing/pgrep_test.sh
14105 /bin/bash /tmp/inferencing/pgrep_test.sh

Why is it that when I execute the pgrep command, it seems to know how to filter out the additional child processes associated with $$ when ps -ef piped to greps is not capable?

This early termination is caused by code using a ps command piped to several grep/grep -v commands. The intent is to check if a process is already running and, if so, not to execute this script again.

Don't do that. It's a horrible method, more so if you don't even bother to check whether your 'grep' matches the input exactly and not just a substring. For example, if you leave vim pgrep_test.sh open, your script will think it's already running.

There are better ways to make a single-instance script:

  • run the script as a systemd .service (either by making your cronjob call 'systemctl start' or by using a systemd .timer to invoke it), as the same service cannot be started twice;

  • or use a lock file through flock, which uses kernel-based exclusive locking to guarantee a single instance.

    * * * * * user1 flock -n /tmp/inferencing/lock /tmp/inferencing/pgrep_test.sh

What even is PID 12961? PPID 12957 is the /bin/bash call but these two commands are identical otherwise

It's the "subshell" that handles the command within $( ... ). Every time you use command substitution, bash spawns a child process to handle it. If a simple command is being substituted, that subshell process may directly 'exec' the command in-place (e.g. in the case of $(ps -ef)), but if a whole pipeline is being substituted, that won't necessarily happen.

While $$ always expands to the PID of the main shell process (i.e. its value is cloned when bash spawns subshells), you can use $BASHPID to get the real process ID of the current interpreter. For example:

$ echo $$, $BASHPID; ps $$ $BASHPID
208231, 208231
 208231 pts/3    Ss     0:00 bash

$ (echo $$, $BASHPID; ps $$ $BASHPID)
208231, 208287
 208231 pts/3    Ss     0:00 bash
 208287 pts/3    S+     0:00 bash

$ { echo $$, $BASHPID; ps $$ $BASHPID; }
208231, 208231
 208231 pts/3    Ss     0:00 bash

$ var=$(echo $$, $BASHPID; ps $$ $BASHPID); echo "$var"
208231, 208294
 208231 pts/3    Ss+    0:00 bash
 208294 pts/3    R+     0:00 ps 208231 208294

The 2nd and 4th examples use subshells (another easy way to detect this is to notice that variables set within a subshell do not get propagated back into the main shell), while the 1st and 3rd don't.

