1

I have 1000000 different requests (different request I mean it has different query param. It is just a GET request, and no payload. Size of request and response are in order of KBs only. no images or complicated stuff) in a text file where each line is cURL followed by the actual URL. Actually, each line could be http url and I can pipe the response to jq and if a condition is met, I will write to a log file (this is what I'm trying to achieve).

I plan to start with 1000, 5000 and reach 10000 reqs/second. We prefer to have sustainted approx. 10000 reqs/sec over few hours (say 48-72 hours) period of time.

Which is the best approach?

  1. Use gnu parallel on text file, where each text file will have 10000 prepared http urls? (Here in the text file curl is better than http?)
  2. While doing the above is each curl request run in its own shell? How do I change this to a single shell that can send req and receive response for 10000 requests/sec?

I spent 2.5 days between xargs and gnu parallel. But some answers for gnu parallel to go for xargs and vice versa in Stack over flow. Also, is there a better tools (say Python tools or ab tools) to send 10000 reqs/sec and for each response if a condition is met, write to a log file. Please help . Thanks.

PS: can golang help better with parallelizing this work better than gnu parallel?

4
  • 3
    It sounds like a performance testing problem. I would use a dedicated tool like JMeter or Gatling, both of them could be programmed and are de facto standard for this type of testing. Writing your own custom solution sound like a nightmare to maintain Commented Jan 5, 2022 at 8:15
  • 1
    +1 to what Alex said above. Please also see web server benchmarking - Wikipedia. Now that you mentioned ApacheBench I think it’s a good start. Commented Jan 5, 2022 at 8:37
  • 1
    At 10.000 reqs/seconds(your target), 1.000.000 requests will only take 100 seconds = less than 2 minutes. But you say that you expect to run at that speed for 48-72 hours? That doesn't really match up. Commented Jan 5, 2022 at 10:14
  • You really need to have an agreement with the owner+hoster of the webserver before sending it 10.000 reqs/second. Given that you have that, I would probably look at ApacheBench (or parallel if I had to do it manually). Commented Jan 5, 2022 at 10:16

3 Answers 3

3

tl;dr

You're aiming for Gb/s speeds, at rates where you can't even spawn processes fast enough, let alone start shells. A shell script hence won't be your solution. Use libcurl with a programming language that allows access to libcurl-multi.

Full answer

Which is the best approach?

  1. Use gnu parallel on text file, where each text file will have 10000 prepared http urls? (Here in the text file curl is better than http?)
  2. While doing the above is each curl request run in its own shell? How do I change this to a single shell that can send req and receive response for 10000 requests/sec?

You want to do 10,000 requests a second "in the order of KBs" (plural), so that's more than a single 1Gb/s ethernet cable. Be sure you understand the architectural implications of that!

First of all, your transmission medium (some ethernet?) is serial. sending multiple packets in parallel is not possible, on the deepest technical level. What is possible is to fill the queue of packets to be sent from multiple cores, and then handle incoming replies in a parallel fashion.

Then: spawning a whole curl process just to do a single request is super inefficient. But spawning a whole shell (I guess you were thinking about bash?) is even worse! (spawning a process means creating / forking a new process, replacing that with the image from the executable, loading all the libraries the executable depends on, parsing the configuration / startup scripts, then doing the real work. And that real work in your case is easier than the whole rest.) (My quick test loop in C says you can't do more than ca 3500× vfork() followed by checking the PID and then execing an empty program per second on an 8-thread 3.6 GHz CPU. And you want to exec something very heavyweight. You want to spawn 3 times the processes that my machine can spawn per second, and still do network and processing in these spawned processes. Not gonna happen, by a large margin.)

You spawing a jq process for every request is outright disastrous. jq has to do two things:

  1. Parse the query you've passed it as argument,
  2. Parse the JSON according to that query.

Now, the JSON is different every time, but relatively straightforward to parse once you know the query (and said query is not too complex), but parsing the query is per se basically compiling a program. You're doing the same compilation a lot of times, where in reality, the query just stays the same.

So, for high-performance testing as this, doing this via shell and parallel/xargs does not work - your systematic inefficiencies will simply not allow you to work at the speeds you need.

So, instead, write a program. The programming language doesn't matter too much, as long as it allows for proper multithreading (maybe avoid PHP, Delphi and um, Visual Basic 6.0), and has access to a reasonably fast JSON parser. Python would work, but Python is not known for good multithreading. But it might still work. Personally, I'd simply write this in C++.


Recommendation:

If you feel like you know your JSON parser well enough to know that you'd rather avoid having to deal with epoll details: libcurl has a nice API for jobs like this: the libcurl-multi multi interface. Do one of these things for every CPU core, and you'll probably be saturating your connection to the server.


If you want to write an application that globally maximizes CPU utilization at the cost of complexity, you'd have separate transmit and receive workers, seeing that sending the request is probably a lot easier than handling the received data. In that, you'd first

  • initialize your multi-thread-safe logging system (I like spdlog, but seeing 10000 potential logs per second, something that aggregates and writes binary data instead of human-readable text files might be much, much more approriate),
  • set up your parser,
  • spawn a bunch (wild guess: as many as you have CPU threads /3 - 1) of transmit workers (TX),
  • spawn a bunch (wild guess: CPU threads·2/3 - 1) of receive workers (RX),
  • spawn a thread that holds an operating system notification token for TCP sockets ready to read data. On linux, that mechanism would be epoll, which is available in Python through the select module as select.epoll.
  • spawn a thread that prepares the requests, establishes the TCP connection, and then assigns them to workers' incoming queues

In each TX worker,

  • you make the prepared request (which might just be a curl_ function call - libcurl is actually a nice library, not just a command line tool)

In each RX worker,

  • You take the data you've just gotten and parse it, calculate your result and if it fits, tell your logging system to log,

In the epoll thread,

  • Handle the events, by getting the data and handing it of (e.g. in a round-robin way) fairly to the RX workers.

If you want an example of using epoll with libcurl, curl has an example (and it's very close to your use case, actually!).

If you want a discussion on how to deal with multi-threads, curl and network in C++, this Stack Overflow answer might be for you.

0

@MarcusMiller sums up very well why you should not expect to be able to do 10000 requests/sec from a single machine. But maybe you are not asking that. Maybe you have 100 machines that can be used for this task. And then it becomes feasible.

On each of the 100 machines run something like:

ulimit -n 1000000
shuf huge-file-with-urls |
  parallel -j1000 --pipe --roundrobin 'wget -i - -O - | jq .'

If you need to run the urls more than once, you simple add more shuf huge-file-with-urls. Or:

forever shuf huge-file-with-urls | [...]

(https://gitlab.com/ole.tange/tangetools/-/tree/master/forever)

GNU Parallel can run more than 1000 jobs in parallel, but it is unlikely you computer can process that many. Even just starting the 1000000 sleeps (See: https://unix.stackexchange.com/a/150686/366317) pushed my 64 core machine to the limit. Starting 1000000 simultaneous processes that actually did something would have made the server stall completely.

A process will migrate from one core to another, so you cannot tell which core is running what. You can limit which core a process should run on with taskset, but that only makes sense in very specialized scenarios. Normally you simply want your task to be moved to a CPU core that is idle, and the kernel does a good job of that.

10
  • Thanks for answering. But here you mentioned about running 2000 jobs in parallel and in fact 20000 jobs in parallel. unix.stackexchange.com/a/150686/366317 Also, how will the command know not to use same line twice?
    – sofs1
    Commented Jan 7, 2022 at 21:34
  • Is it possible to find how many jobs are running in each core at a given moment using some command?
    – sofs1
    Commented Jan 7, 2022 at 22:11
  • as I and Ole said, what you want is not feasible. Commented Jan 7, 2022 at 22:40
  • @sofs1 See edit.
    – Ole Tange
    Commented Jan 7, 2022 at 23:15
  • Man. In the last few days I learned so much about CPUs. I wish I could spend more time. Thanks for th edit. One other question. In this example gnu.org/software/parallel/… of running 250 jobs in parallel, does the value for -N means, create 50 jobs and for each job run further 50 job? I did man parallel and -N means Use at most max-args arguments per command line. I couldn't wrap my head around it. Could you help me understand or break down cat myinput |\ parallel --pipe -N 50 --roundrobin -j50 parallel -j50 your_prg
    – sofs1
    Commented Jan 7, 2022 at 23:47
0

@MarcusMiller sums up very well why you should not expect to be able to do 10000 requests/sec from a single machine.

I think I have to revise that.

I just installed varnish and ab on my 64 core machine and then I ran:

$ seq 64 | parallel -N0 ab -c 100 -n 100000 http://lo:6081/
Server Software:        Apache/2.4.41 (really Varnish)
Server Hostname:        lo
Server Port:            6081

Document Path:          /
Document Length:        10918 bytes

Concurrency Level:      100
Time taken for tests:   132.916 seconds
Complete requests:      100000
Failed requests:        0
Total transferred:      1127168742 bytes
HTML transferred:       1091800000 bytes
Requests per second:    752.35 [#/sec] (mean)
Time per request:       132.916 [ms] (mean)
Time per request:       1.329 [ms] (mean, across all concurrent requests)
Transfer rate:          8281.53 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0   94 484.9      1   15453
Processing:     0   38  68.9     28    4672
Waiting:        0   37  68.9     27    4672
Total:          1  132 498.4     29   15683

I get 64*720 requests = around 45000 requests and replies per second.

Each reply is 10 Kbytes = 4.5 Gbit/s

Around 50% of the CPU is varnish. The rest is ab.

I think you can do 10000 request/sec at 1 Gbps on a 16 core machine. But you cannot do any processing of the result, and I do not see a way to tell ab to use different requests.

If you use https://github.com/philipgloyne/apachebench-for-multi-url you can even use different urls:

parallel --pipepart --block -1 --fifo -a urls ./ab -L {} -c 100 -n 100000
3

Not the answer you're looking for? Browse other questions tagged .