1

My question is simple - is there a way to display curl's individual Exit Status for each URL when curl is doing multiple requests?

Let's imagine that I need to check sites a.com, b.com, c.com and see their:

  • HTTP return code
  • if HTTP return code is 000, I need to display curl's exit code.

NOTE - a.com, b.com, c.com are used as an example in this code/question. In the real script, I do have a list of valid URLs - more than 400 of them with non-overlapping patterns - and they return a variety of HTTP codes - 200/4xx/5xx as well as 000.

The 000 is the case when curl could not make a connection but provides Exit Codes to understand what prevented it to establish a connection. In my cases, there are a number of exit codes as well - 6, 7, 35, 60.

I tried to run the following code

unset a
unset rep
a=($(curl -s --location -o /dev/null -w "%{response_code}\n" {https://a.com,https://b.com,https://a.com}))
rep+=("$?")
printf '%s\n' "${a[@]}"
echo
printf '%s\n' "${rep[@]}"

While the above code returns the HTTP return code for each individual request, the Exit Code is displayed only from the last request.

000
000
000

60

I do need the ability to log individual Exit Code when I supply multiple URLs to curl. Is there a workaround/solution for this problem?

Some additional information: currently I put all my URLs in an array and run a cycle thru it checking each URL separately. However, going thru 400 URLs takes 1-2 hours and I need to somehow speed up the process. I did try to use -Z with curl. While it did speed up the process about 40-50%, it didn't help because in addition to show only the above-mentioned last Exit Status, the Exit Status, in this case, is always displayed as 0, which is not correct.

P.S. I am open to using any other command-line tool if it can resolve the above problem - parallel checking of 10s/100s of URLs with logging of their HTTP codes and if the connection can't be established - log additional information like curl's Exit Codes do.

Thanks.

5
  • 1
    Out of curiosity: what is the point of braces in {https://a.com,https://b.com,https://a.com}? I think Bash expands this string to https://a.com https://b.com https://a.com which is even a little simpler; so why not to KISS by using the latter form directly? If you used https://{a,b,a}.com, then I would understand because this form is shorter and DRY. For now I think the braces only obfuscate the code. What am I missing? Commented Oct 14, 2020 at 23:10
  • Fair question. Let me explain. As it is documented at curl's howto, the expansion's purpose is to apply supplied curl parameters for every site in the braces. Without braces, the alternative is to use switch --next and then repeat all curl's parameters for each site. You are correct that https://{a,b,a}.com would be better stylistically for this particular example, however, as I mentioned in the description, a/b/c is used for illustrative purposes. In my real code with 400 URLs they are different and have no common patterns. Commented Oct 15, 2020 at 5:08
  • OK, so it's deeper. The braces in the howto are quoted. Your question is tagged bash. Bash expands unquoted braces. In your code curl doesn't see the braces, it sees what the shell gives to it. Now I'm not saying it doesn't work for you in this particular case. I'm saying it's Bash who handles the syntax when you think it's curl. This misunderstanding may lead to subtle bugs. Commented Oct 15, 2020 at 5:16
  • I understand my concern is irrelevant to the issue apparently; illustrative purposes, I get it. Still if you ever want to use braces with curl in serious purposes, keep in mind the shell may handle them first in some circumstances; this may affect the overall result. Commented Oct 15, 2020 at 6:10
  • Understood. When enclosing expressions inside braces in double-quotes in my case output/result is the same. URL statements in curl examples on that site are always enclosed in double-quotes, however, it does not really affect the outcome. Commented Oct 15, 2020 at 7:27

1 Answer 1

5

Analysis

The exit code is named "exit code" because it is returned when a command exits. If you run just one curl then it will exit exactly once.

curl, when given one or more URLs, might provide a way to retrieve a code equivalent to the exit code of separate curl handling just the current URL; it would be something similar to %{response_code} you used. Unfortunately it seems there is no such functionality (yet; add it maybe). To get N exit codes you need N curl processes. You need to run something like this N times:

curl … ; echo "$?"

I understand your N is about 400, you tried this in a loop and it took hours. Well, spawning 400 curls (even with 400 echos, if echo wasn't a builtin; and even with 400 (sub)shells, if needed) is not that time consuming. The culprit is in the fact you run all these synchronously (didn't you?).


Simple loop and its problems

It's possible to loop and run the snippet asynchronously:

for url in … ; do
   ( curl … ; echo "$?" ) &
done

There are several problems with this simple approach though:

  1. You cannot easily limit the number of curls that run simultaneously, there is no queue. This can be very bad in terms of performance and available resources.
  2. Concurrent output from two or more commands (e.g from two or more curls) may get interleaved, possibly mid-line.
  3. Even if output from each command separately looks fine, curl or echo from another subshell may cut in between curl and its corresponding echo.
  4. There is no guarantee a subshell invoked earlier starts (or ends) printing before a subshell invoked later.

parallel

The right tool is parallel. Basic variant of the tool (from moreutils, at least in Debian) solves (1). It probably solves (2) in some circumstances. This is irrelevant anyway because this variant does not solve (3) or (4).

GNU parallel solves all these problems.

  • It solves (1) by design.

  • It solves (2) and (3) with its --group option:

    --group
    Group output. Output from each job is grouped together and is only printed when the command is finished. Stdout (standard output) first followed by stderr (standard error). […]

    (source)

    which is the default, so usually you don't have to use it explicitly.

  • It solves (4) with its --keep-order option:

    --keep-order
    -k
    Keep sequence of output same as the order of input. Normally the output of a job will be printed as soon as the job completes. […] -k only affects the order in which the output is printed - not the order in which jobs are run.

    (source)

In Debian GNU parallel is in a package named parallel. The rest of this answer uses GNU parallel.


Basic solution

<urls parallel -j 40 -k 'curl -s --location -o /dev/null -w "%{response_code}\n" {}; echo "$?"'

where urls is a file with URLs and -j 40 means we allow up to 40 parallel jobs (adjust it to your needs and abilities). In this case it's safe to embed {} in the shell code. It's an exception explicitly mentioned in this answer: Never embed {} in the shell code!

The output will be like

404
0
200
0
000
7
…

Note the single-quoted string is the shell code. Within it you can implement some logic, so exit code 0 is never printed. If I were you I would print it anyway, in the same line, on the leading position:

<urls parallel -j 40 -k '
   out="$(
      curl -s --location -o /dev/null -w "%{response_code}" {}
   )"
   printf "%s %s\n" "$?" "$out"'

Now even if some curl is manually killed before it prints, you will get something in the first column. This is useful for parsing (we'll return to it). Example:

0 404
0 200
7 000
…
143 
…

where 143 means curl was terminated (see Default exit code when process is terminated).


With arrays

If your URLs are in an array named urls, avoid this syntax:

parallel … ::: "${urls[@]}"    # don't

parallel is an external command. If the array is large enough then you will hit argument list too long. Use this instead:

printf '%s\n' "${urls[@]}" | parallel …

It will work because in Bash printf is a builtin and therefore everything before | is handled internally by Bash.

To get from urls array to a and rep arrays, proceed like this:

unset a
unset rep
while read -r repx ax; do
   rep+=("$repx")
   a+=("$ax")
done < <(printf '%s\n' "${urls[@]}" \
         | parallel -j 40 -k '
              out="$(
                 curl -s --location -o /dev/null -w "%{response_code}" {}
              )"
         printf "%s %s\n" "$?" "$out"')
printf '%s\n' "${a[@]}"
echo
printf '%s\n' "${rep[@]}"

Notes

  • If we generated exit codes in the second column (which is easier, you don't need a helper variable like out) and adjusted our read accordingly, so it's read -r ax repx, then a line <empty ax><space>143 would save 143 into ax because read ignores leading spaces (it's complicated). By reversing the order we avoid a bug in our code. A line like 143<space><empty ax> is properly handled by read -r repx ax.

  • You will hopefully be able to check 400 URLs in few minutes. The duration depends on how many jobs you allow in parallel (parallel -j …), but also on:

    • how fast the servers respond;
    • how much data and how fast curls download;
    • options like --connect-timeout and --max-time (consider using them).
5
  • Interesting fact: the author and maintainer of GNU parallel is Ole Tange, our fellow user here on Super User (although more active on Unix & Linux SE). I think it's great developers provide support here (on Stack Exchange in general). Check if you can find the main author of curl. Commented Oct 15, 2020 at 15:14
  • Thank you very much. I did read about the parallel a couple of years ago but didn't think about using it. Commented Oct 16, 2020 at 19:13
  • I did implement your solution - processing time of 400 URLs with parallel -j 20 went down from 40 min to 5. Thanks again. My question is - is there a way in your example to get an individual string passed to the parallel in the form of variable? Commented Dec 5, 2020 at 9:23
  • @Invisible999 A common, fixed string? The environment variable is a generic way to make some string available to any command (the command may or may not be designed to use it though). Furthermore "passed to the parallel" and "passed to the code run by parallel" are not equivalent. See man 1 parallel, option named --env. If this is not what you want then you need to bee more descriptive. Probably not in a comment. Ask a new question maybe. Commented Dec 5, 2020 at 10:14
  • @Invisible999 Oh, now I see this question. Sorry, I interpreted "get an individual string passed …" similarly to "get something done". It seems you want to "get a string that was passed". English is a foreign language to me. Commented Dec 5, 2020 at 10:21

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .