3

Sample code:

#include <stdio.h>
#include <unistd.h>
#include <sched.h>
#include <pthread.h>

int
main (int argc, char **argv)
{

  unsigned char buffer[128];
  char buf[0x4000];
  setvbuf (stdout, buf, _IOFBF, 0x4000);
  fork ();
  fork ();

  pthread_t this_thread = pthread_self ();

  struct sched_param params;

  params.sched_priority = sched_get_priority_max (SCHED_RR);

  pthread_setschedparam (this_thread, SCHED_RR, &params);


  while (1)
    {
      fwrite (&buffer, 128, 1, stdout);
    }
}

This program opens 4 threads and outputs on stdout the contents of "buffer" which is 128 bytes or 16 long ints on a 64 bit cpu.

If I then run:

./writetest | pv -ptebaSs 800G >/dev/null

I get a speed of about 7.5 GB/s.

Incidentally, that is the same speed I get if I do:

$ mkfifo out
$ dd if=/dev/zero bs=16384 >out &
$ dd if=/dev/zero bs=16384 >out &
$ dd if=/dev/zero bs=16384 >out &
$ dd if=/dev/zero bs=16384 >out &
pv <out -ptebaSs 800G >/dev/null

Is there any way to make this faster? Note. the buffer in the real program is not filled with zeroes.

my curiosity is to understand how much data can a single program (mutithreaaded or multiprocess) output

It looks like 4 people didn't understand this simple question. I even put in bold the reason of the question.

10
  • 1
    BTW fork doesn't create a thread. Not my DV/CV though. Commented Mar 28, 2019 at 11:08
  • @Jabberwocky I could use pthread.. but there is no speed improvement.. the bottleneck seems to be the pipe.
    – Zibri
    Commented Mar 28, 2019 at 11:32
  • my curiosity is to understand how much data can a single program (mutithreaaded or multiprocess) output.
    – Zibri
    Commented Mar 28, 2019 at 11:34
  • 1
    You should make that clear in the question. Commented Mar 28, 2019 at 11:35
  • Did you already tried "write(STDOUT_FILENO, ...)" ? And are functions like sendfile() and vmsplice() suitable?
    – SKi
    Commented Mar 28, 2019 at 11:36

3 Answers 3

5

Well it seems that linux scheduler and IO priorities played had a big role in the slowdown.

Also, spectre and other cpu vunerability mitigations came to play.

After further optimization, to achieve a faster speed I had to tune this things:

1) program nice level (nice -n -20)
2) program ionice level (ionice -c 1 -n 7)
3) pipe size increased 8 times.
4) disable cpu mitigations by adding "pti=off spectre_v2=off l1tf=off" in kernel command line
5) tuning the linux scheduler

echo -n -1 >/proc/sys/kernel/sched_rt_runtime_us
echo -n -1 >/proc/sys/kernel/sched_rt_period_us
echo -n -1 >/proc/sys/kernel/sched_rr_timeslice_ms
echo -n 0 >/proc/sys/kernel/sched_tunable_scaling

Now the program outputs (on the same pc) 8.00 GB/sec!

If you have other ideas you're welcome to contribute.

1

First you need to determine your rate limiting factor. It could be the cpu/memory speed, the cpu/system-call latency, the pipe implementation, the stdio implementation. There are probably more, but that is a good start:

  1. cpu/memory -- test how fast you can memcpy a bunch of zeroes.

  2. cpu/syscall -- test, by writing 1byte to /dev/null, how long it takes to do a simple write on your system

  3. pipe implementation -- you sort of have this, but you could try and vary the pipe capacity ( fcntl(2) F_GETPIPE_SZ. F_SETPIPE_SZ, if you are on linux).

  4. stdio implementation -- replace fwite/setbuf with write. I would suggest aligning your write size with the pipe capacity/num-processes might yield a good result, but you should probably investigate more broadly.

Try all of the above with multiple processes, although you might need to scaleup the memcpy ones to get meaningful results.

With these numbers, you should be able to calculate what your maximum throughput. Please report back, I am sure more than a few people are interested.

3
  • hmm.. I tried all that... strangely write+buffer makes it slower than the code above. About the fnctl I didn't try it and I think that could improve things, but I have never used it... if you want to make some tests then you could post your code modification to mine if it's faster...
    – Zibri
    Commented Mar 28, 2019 at 16:19
  • I just tried with F_SETPIPE_SZ after opening a fifo... the speed is slower by 30%... but you are free to try...maybe I did something wrong.
    – Zibri
    Commented Mar 28, 2019 at 16:57
  • update: doubling the pipe size improves the speed.. but raising it further does not seem to improve the speed further.
    – Zibri
    Commented Mar 30, 2019 at 13:05
-1

What you program does is:

  1. Calls fwrite. That merely copies data from buffer to buf.
  2. Once buf fills up it calls write.

To speed it up, avoid that copy in step 1 and fwrite and use write syscall directly. E.g.:

char buf[0x4000];
for(;;)
    write(STDOUT_FILENO, buf, sizeof buf); // Implement error handling.

You may also like to make buf bigger to minimize the number of syscalls (Spectre mitigations made syscalls more expensive).

3
  • @Zibri It may be slow only if you write a small array. Write a bigger array, e.g. 4MB. You need to play with the array size to maximize your throughput. Commented Mar 28, 2019 at 16:19
  • I tried... and since I get the same speed as with the original program I'd rather keep it as it is... as I said the bottleneck is in the pipe 90%...
    – Zibri
    Commented Mar 28, 2019 at 17:45
  • 1
    @Zibri You are right that it is I/O bound, but that unnecessary copying doesn't help. Commented Mar 28, 2019 at 18:41

Not the answer you're looking for? Browse other questions tagged or ask your own question.