2

On Debian 8.1, I'm using a Bash feature to detect whether the stackoverflow.com website is reachable:

(echo >/dev/tcp/stackoverflow.com/80) &>/dev/null || echo "stackoverflow unreachable"

This is Bash-specific and will not work in sh, the default shell of cron.

If we, on purpose, try the script in sh, we get:

$ /bin/sh: 1: cannot create /dev/tcp/stackoverflow.com/80: Directory nonexistent

Hence, if I only put the following in my personal crontab (without setting SHELL to /bin/bash) via crontab -e, I expect that once per minute, the script will be executed, and I therefore expect to also get the above error sent per mail once per minute:

* * * * * (echo >/dev/tcp/stackoverflow.com/80) &>/dev/null || echo "stackoverflow unreachable"

And indeed, exactly as expected, we see from /var/log/syslog that the entry is executed once per minute:

# sudo grep stackoverflow /var/log/syslog
Aug 24 18:58:01 localhost CRON[13719]: (mat) CMD ((echo >/dev/tcp/stackoverflow.com/80) &>/dev/null || echo "stackoverflow unreachable")
Aug 24 18:59:01 localhost CRON[13723]: (mat) CMD ((echo >/dev/tcp/stackoverflow.com/80) &>/dev/null || echo "stackoverflow unreachable")
Aug 24 19:00:01 localhost CRON[13727]: (mat) CMD ((echo >/dev/tcp/stackoverflow.com/80) &>/dev/null || echo "stackoverflow unreachable")
...

During the last ~2 hours, this was executed more than 120 times already, as I can verify with piping the output to wc -l.

However, from these >120 times the shell command (to repeat: the shell command is invalid for /bin/sh) has been executed, I only got three e-mails:

The first one at 19:10:01, the second at 20:15:01, and the third at 20:57:01.

The content of all three mails reads exactly as expected and contains exactly the error message that is to be expected from running the script in an incompatible shell (on purpose). For example, the second mail I received reads (and the other two are virtually identical):

From [email protected]  Mon Aug 24 20:15:01 2015
From: [email protected] (Cron Daemon)
To: [email protected]
Subject: Cron  (echo >/dev/tcp/stackoverflow.com/80)&>/dev/null || echo "stackoverflow unreachable"
...

/bin/sh: 1: cannot create /dev/tcp/stackoverflow.com/80: Directory nonexistent`

From /var/log/mail.log, I see that these three mails were the only mails sent and received in the last hours.

Thus, where are the >100 additional mails we would expect to receive from cron due to the above output that is created by the erroneous script?

To summarize:

  1. Mail is configured correctly on this system, I can send and receive mails without problem with /usr/bin/sendmail.
  2. Cron is set up correctly, notices the task as expected and executes it precisely at the configured times. I have tried many other tasks and scheduling options, and cron executed them all exactly as expected.
  3. The script always writes output (see below) and we thus expect cron to send the output to me via mail for each invocation.
  4. The output is mailed to me only occasionally, and apparently ignored in most cases.

There are many ways to work around the obvious mistake that led to the above observations:

  1. I can set SHELL=/bin/bash in my crontab.
  2. I can create a heartbeat.sh with #!/bin/bash, and invoke that.
  3. I can invoke the script with /bin/bash -c ... within crontab.
  4. etc., all fixing the mistake of using a Bash-specific feature within sh.

However, all of this does not address the core issue of this question, which is that in this case, cron does not reliably send mails even though the script always creates output.

I have verified that the script always creates output by creating wrong.sh (which again on purpose uses the unsuitable /bin/sh shell, to produce the same error that cron should see):

#!/bin/sh
(echo >/dev/tcp/stackoverflow.com/80) &>/dev/null || echo "stackoverflow unreachable"

Now I can invoke the script in a loop and see if there ever is a case where it finishes without creating output. Using Bash:

$ while true; do [[ -n $(./wrong.sh 2>&1 ) ]]; echo  $?; done | grep -v 0

Even in thousands of invocations, I could not reproduce a case where the script finishes without creating output.

What may be the cause of this unpredictable behaviour? Can anyone reproduce this? To me, it looks like there may be a race condition where cron can miss a script's output, possibly primarily involving cases where the error stems from the shell itself. Thank you!

6
  • any evidence in your mail spooler's log that it makes it that far?
    – Jeff Schaller
    Commented Aug 24, 2015 at 20:13
  • Thank you very much for looking into this! When I do mailq or sudo mailq, I get: Mail queue is empty, so the mails are not in the spool either.
    – mat
    Commented Aug 24, 2015 at 21:16
  • (I have checked the source of the cron daemon to verify that the installed and working postfix infrastructure is used for sending mails in my case.)
    – mat
    Commented Aug 24, 2015 at 21:28
  • One additional point: When I change the line in crontab to * * * * * ~/wrong.sh (contents of ~/wrong.sh shown above), then I reliably get an email, showing the error, about every second minute. This is still not what I expect from the daemon, but it is a huge improvement over the in-line invocation of the command, in terms of reliability, and may help to narrow down the cause of this.
    – mat
    Commented Aug 24, 2015 at 21:40
  • I am not familiar enough with postfix, but I know sendmail has configuration parameters to drop emails if the load average is too high; so: does your system often have a high load average? If so, does postfix have a similar config parameter? Also, I'm aware-but-not-familiar with the latest Debian using systemd; are there new or custom startup parameters there? Also, are the missing emails going to the root account instead of your own?
    – Jeff Schaller
    Commented Aug 25, 2015 at 1:03

3 Answers 3

5

Upon further testing, I suspect the & is messing with your results. As you point out, &>/dev/null is bash syntax, not sh syntax. As a result, sh is creating a subshell and backgrounding it. Sure, the subshell's echo creates stderr, but my theory is that:

  1. cron is not catching the subshell's stderr, and
  2. the backgrounding of the subshell always completes successfully, thus bypassing your || echo ....

... causing the cron job to have no output and thus no mail. Based on my reading of the vixie-cron source, it would seem that the job's stderr and stdout would be captured by cron, but it must be getting lost by the subshell.

Test it yourself in a /bin/sh environment (assuming you do not have a file named 'bar' here):

(grep foo bar) &
echo $?
2
  • 3
    I think the answer is close to what you said, but actually a race between the subshell and cron. Since the subshell is backgrounded, cron stops listening to it promptly. But occasionally it manages to squeeze out its error message very quickly while cron is still listening.
    – Celada
    Commented Aug 25, 2015 at 2:52
  • Many thanks for looking into this, and noting the role that & plays!
    – mat
    Commented Aug 26, 2015 at 7:56
3

I could reproduce the phenomenon on Ubuntu 15.04 with the following crontab:

* * * * * { echo job 0; } & sleep 5
* * * * * { echo job 1; } &
* * * * * { sleep 5; echo job 2; } &

I got mails from cron with job 0 every minute, mails with job 1 occasionally (5-6 times in last 10 minutes), no mails with job 2.

So it seems cron waits for the child process's exit, then send a mail with all stdout/stderr output it can slurp at that time. Delayed output from orphaned grandchild process is simply discarded.

1
  • Many thanks for looking into this, and especially for these great test cases! Just one additional question: Is it possible to emulate, via Bash commands, the situation that cron is in? That is: Is there a way to simulate the scenario that we start a subshell, and output from the grandchild is sometimes seen and sometimes not, nondeterministically depending on how the race condition plays out?
    – mat
    Commented Aug 26, 2015 at 7:53
0

In addition to the comments above, I wonder whether a slightly simpler explanation might be possible.

Recall that any shell will expand/process the command line before it does anything. As a result when you expand with sh the '&' terminates the command line and tries to put it to background (and by default at a lower priority) and will not correctly assign stdin/stdout/stderr depending on what has been redirected. So maybe the "race condition" is on how fast the shell (sh) processes the line which will clearly depend on the load on your system (and subsequently how it then interacts with cron).

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .