1

I'm currently refactoring a script that works well if executed in terminal directly but exits early due to a process check if executed from crontab. This early termination is caused by code using a ps command piped to several grep/grep -v commands. The intent is to check if a process is already running and, if so, not to execute this script again. I know the reason this code isn't working is because it's trying to catch all processes but doesn't grep -v out the /bin/sh -c <script name> process that crontab always uses to call scripts initially. In refactoring this code it just made sense to use something like pgrep over a ps piped to several greps.

Here's where my question comes in. My pgrep code works, I just don't fully understand why it works. When comparing the output of pgrep pgrep_test.sh | grep -v $$ to ps -ef | grep pgrep_test.sh there are additional processes that the pgrep command seems to remove. It seems to me that pgrep is grouping several PIDs together like, it understands and follows the PID/PPID relationship. The problem is I don't see anything about that written in the pgrep manpage.

I think to understand why pgrep is working in my code I need to better understand how pgrep groups PIDs/PPIDs. Here's the code I'm using to test this, it's executing via crontab:

* * * * * user1 /tmp/inferencing/pgrep_test.sh 2>&1 >> /tmp/inferencing/test.log

The code itself:

#!/bin/bash

# Test how grepping for PIDs works when script is called from crontab

echo "+++++$(date +"%b %H:%M:%S") Beginning pgrep script+++++"

pgrep pgrep_test.sh | grep -v "$$" > /dev/null 2>&1
# RC=1 -- No additional processes running
# RC=0 -- Additional processes running
RETCODE=$?
echo "Return code: $RETCODE"
if [ $RETCODE -eq 0 ]; then
    echo "Additional test processes exist, exiting script"
    echo "$(ps -ef | grep pgrep_test.sh)" ; sleep 1
    echo "$(pgrep -a pgrep_test.sh | grep -v \"$$\")"
    exit 1
else
    echo "No additional processes found, continuing execution"
    echo "$(ps -ef | grep pgrep_test.sh)" ; sleep 1
    echo "$(pgrep -a pgrep_test.sh | grep -v \"$$\")"
    sleep 90
fi

I've used a 90 second sleep in the code to ensure that a cronjob running every minute will fail out every other time. Here's what the logfile looks like with some additional annotations in form of comments.

First with no additional processes running:

"+++++Nov 17:04:01 Beginning pgrep script+++++"
# Sleeping every 90s means we should have alternating "no additional
# processes found" and "additional processes found" logs each execution
Return code: 1
No additional processes found, continuing execution

# ps -ef | grep pgrep_test.sh
# Initial /bin/sh -c call crontab executes
user1   12956 12954  0 17:04 ?        00:00:00 /bin/sh -c /tmp/inferencing/pgrep_test.sh 2>&1 >> /tmp/inferencing/test.log
# Child process spawned from 12956 (Shouldn't this be PID for $$?)
user1   12957 12956  0 17:04 ?        00:00:00 /bin/bash /tmp/inferencing/pgrep_test.sh
# What even is PID 12961? PPID 12957 is the /bin/bash call but these two commands are identical otherwise
# Can't be the pgrep or sleep as these have not executed yet
# Technically there's more than 1 process here now so pgrep should be giving a return code of 0 and exiting the script
# Why does it work correctly here? How does pgrep know to group these?
user1   12961 12957  0 17:04 ?        00:00:00 /bin/bash /tmp/inferencing/pgrep_test.sh
user1   12963 12961  0 17:04 ?        00:00:00 grep pgrep_test.sh

# pgrep -a pgrep_test.sh | grep -v $$
12957 /bin/bash /tmp/inferencing/pgrep_test.sh
12965 /bin/bash /tmp/inferencing/pgrep_test.sh

Next with a matching process already running:

"+++++Nov 17:05:01 Beginning pgrep script+++++"
# Since other process is still sleeping, we correctly get a return code of 0 and stop script execution
Return code: 0
Additional test processes exist, exiting script

# crontab process for (now sleeping) original script call
user1   12956 12954  0 17:04 ?        00:00:00 /bin/sh -c /tmp/inferencing/pgrep_test.sh 2>&1 >> /tmp/inferencing/test.log
# Sleeping process
user1   12957 12956  0 17:04 ?        00:00:00 /bin/bash /tmp/inferencing/pgrep_test.sh
# New crontab process
user1   13733 13594  0 17:05 ?        00:00:00 /bin/sh -c /tmp/inferencing/pgrep_test.sh 2>&1 >> /tmp/inferencing/test.log
# New main bash process
user1   13734 13733  0 17:05 ?        00:00:00 /bin/bash /tmp/inferencing/pgrep_test.sh
# Second main bash process again -- this happens every time
user1   13738 13734  0 17:05 ?        00:00:00 /bin/bash /tmp/inferencing/pgrep_test.sh
# No grep -v grep, this doesn't show up to pgrep anyways
user1   13740 13738  0 17:05 ?        00:00:00 grep pgrep_test.sh

# pgrep -a pgrep_test.sh | grep -v $$
12957 /bin/bash /tmp/inferencing/pgrep_test.sh
13734 /bin/bash /tmp/inferencing/pgrep_test.sh
14105 /bin/bash /tmp/inferencing/pgrep_test.sh

Why is it that when I execute the pgrep command, it seems to know how to filter out the additional child processes associated with $$ when ps -ef piped to greps is not capable?

1 Answer 1

1

This early termination is caused by code using a ps command piped to several grep/grep -v commands. The intent is to check if a process is already running and, if so, not to execute this script again.

Don't do that. It's a horrible method, more so if you don't even bother to check whether your 'grep' matches the input exactly and not just a substring. For example, if you leave vim pgrep_test.sh open, your script will think it's already running.

There are better ways to make a single-instance script:

  • run the script as a systemd .service (either by making your cronjob call 'systemctl start' or by using a systemd .timer to invoke it), as the same service cannot be started twice;

    [Service]
    Type=oneshot
    User=user1
    ExecStart=/tmp/inferencing/pgrep_test.sh
    
  • or use a lock file through flock, which uses kernel-based exclusive locking to guarantee a single instance.

    * * * * * user1 flock -n /tmp/inferencing/lock /tmp/inferencing/pgrep_test.sh
    

What even is PID 12961? PPID 12957 is the /bin/bash call but these two commands are identical otherwise

It's the "subshell" that handles the command within $( ... ). Every time you use command substitution, bash spawns a child process to handle it. If a simple command is being substituted, that subshell process may directly 'exec' the command in-place (e.g. in the case of $(ps -ef)), but if a whole pipeline is being substituted, that won't necessarily happen.

While $$ always expands to the PID of the main shell process (i.e. its value is cloned when bash spawns subshells), you can use $BASHPID to get the real process ID of the current interpreter. For example:

$ echo $$, $BASHPID; ps $$ $BASHPID
208231, 208231
    PID TTY      STAT   TIME COMMAND
 208231 pts/3    Ss     0:00 bash

$ (echo $$, $BASHPID; ps $$ $BASHPID)
208231, 208287
    PID TTY      STAT   TIME COMMAND
 208231 pts/3    Ss     0:00 bash
 208287 pts/3    S+     0:00 bash

$ { echo $$, $BASHPID; ps $$ $BASHPID; }
208231, 208231
    PID TTY      STAT   TIME COMMAND
 208231 pts/3    Ss     0:00 bash

$ var=$(echo $$, $BASHPID; ps $$ $BASHPID); echo "$var"
208231, 208294
    PID TTY      STAT   TIME COMMAND
 208231 pts/3    Ss+    0:00 bash
 208294 pts/3    R+     0:00 ps 208231 208294

The 2nd and 4th examples use subshells (another easy way to detect this is to notice that variables set within a subshell do not get propagated back into the main shell), while the 1st and 3rd don't.

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .