13

In a coding competition specified at this link there is a task where you need to read much data on stdin, do some calculations and present a whole lot of data on stdout.

In my benchmarking it is almost only i/o that takes time although I have tried optimizing it as much as possible.

What you have as input is a string (1 <= len <= 100'000) and q rows of pair of int where q also is 1 <= q <= 100'000.

I benchmarked my code on a 100 times larger dataset (len = 10M, q = 10M) and this is the result:

 Activity            time      accumulated

 Read text:          0.004     0.004
 Read numbers:       0.146     0.150
 Parse numbers:      0.200     0.350
 Calc answers:       0.001     0.351
 Format output:      0.037     0.388
 Print output:       0.143     0.531

By implementing my own formating and number parsing inline i managed to get the time down to 1/3 of the time when using printf and scanf.

However when I uploaded my solution to the competitions webpage my solution took 1.88 seconds (I think that is the total time over 22 datasets). When I look in the high-score there are several implementations (in c++) that finished in 0.05 seconds, nearly 40 times faster than mine! How is that possible?

I guess that I could speed it up a bit by using 2 threads, then I can start calculating and writing to stdout while still reading from stdin. This will however decrease the time to min(0.150, 0.143) in a theoretical best case on my large dataset. I'm still nowhere close to the highscore..

In the image below you can see the statistics of the consumed time.

Statistics of the consumed time

The program gets compiled by the website with this options:

gcc -g -O2 -std=gnu99 -static my_file.c -lm

and timed like this:

time ./a.out < sample.in > sample.out

My code looks like this:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define MAX_LEN (100000 + 1)
#define ROW_LEN (6 + 1)
#define DOUBLE_ROW_LEN (2*ROW_LEN)

int main(int argc, char *argv[])
{
    int ret = 1;

    // Set custom buffers for stdin and out
    char stdout_buf[16384];
    setvbuf(stdout, stdout_buf, _IOFBF, 16384);
    char stdin_buf[16384];
    setvbuf(stdin, stdin_buf, _IOFBF, 16384);

    // Read stdin to buffer
    char *buf = malloc(MAX_LEN);
    if (!buf) {
        printf("Failed to allocate buffer");
        return 1;
    }
    if (!fgets(buf, MAX_LEN, stdin))
        goto EXIT_A;

    // Get the num tests
    int m ;
    scanf("%d\n", &m);

    char *num_buf = malloc(DOUBLE_ROW_LEN);
    if (!num_buf) {
        printf("Failed to allocate num_buffer");
        goto EXIT_A;
    }

    int *nn;
    int *start = calloc(m, sizeof(int));
    int *stop = calloc(m, sizeof(int));
    int *staptr = start; 
    int *stpptr = stop;
    char *cptr;
    for(int i=0; i<m; i++) {
        fgets(num_buf, DOUBLE_ROW_LEN, stdin);
        nn = staptr++;
        cptr = num_buf-1;
        while(*(++cptr) > '\n') {
            if (*cptr == ' ')
                nn = stpptr++;
            else
                *nn = *nn*10 + *cptr-'0';
        }
    }


    // Count for each test
    char *buf_end = strchr(buf, '\0');
    int len, shift;
    char outbuf[ROW_LEN];
    char *ptr_l, *ptr_r, *out;
    for(int i=0; i<m; i++) {
        ptr_l = buf + start[i];
        ptr_r = buf + stop[i];
        while(ptr_r < buf_end && *ptr_l == *ptr_r) {
            ++ptr_l;
            ++ptr_r;
        }

        // Print length of same sequence
        shift = len = (int)(ptr_l - (buf + start[i]));
        out = outbuf;
        do {
            out++;
            shift /= 10;
        } while (shift);
        *out = '\0';
        do {
            *(--out) = "0123456789"[len%10];
            len /= 10;
        } while(len);
        puts(outbuf);
    }



    ret = 0;

    free(start);
    free(stop);
EXIT_A:
    free(buf);
    return ret;
}
8
  • Why are you allocating memory for individual ints? What system are you on? On Linux, stdio is faster (and faster than iostreams on Windows), on Windows, iostream outpeforms stdio. stdio can be made somewhat faster by using the unlocked variants of the IO functions (puts_unlocked instead of puts, etc.) as POSIX requires stdio to use recursive locks for the calls, while no such requirement exists for iostream (AFAIK). Commented Apr 23, 2017 at 15:50
  • Looks like you are doing output each time through the loop. What about if you traded memory for speed, allocated a larger buffer, and then printed the entire output all at once? Or, if there's too much output for that to be feasible, you can still consolidate the outputs substantially via buffering. This would solve the problem if puts is actually your bottleneck. I'm not sure how you're measuring to arrive at those times. What all operations are included in, say, the "Print output" measurement? Commented Apr 23, 2017 at 16:01
  • Minor: cptr = num_buf-1; is undefined behavior - although it likely "works" as desired. Commented Apr 24, 2017 at 3:47
  • the problem states the maximum size for each array, so eliminate the calls to malloc() and calloc() and just declare the arrays big enough for the max amount of data Commented Aug 21, 2017 at 23:37
  • Strongly suggest not bothering to make the calls to setvbuf(), Commented Aug 21, 2017 at 23:38

3 Answers 3

2

Thanks to your question, I went and solved the problem myself. Your time is better than mine, but I'm still using some stdio functions.

I simply do not think the high score of 0.05 seconds is bona fide. I suspect it's the product of a highly automated system that returned that result in error, and that no one ever verified it.

How to defend that assertion? There's no real algorithmic complexity: the problem is O(n). The "trick" is to write specialized parsers for each aspect of the input (and avoid work done only in debug mode). The total time for 22 trials is 50 milliseconds, meaning each trial averages 2.25 ms? We're down near the threshold of measurability.

Competitions like the problem you addressed yourself to are unfortunate, in a way. They reinforce the naive idea that performance is the ultimate measure of a program (there's no score for clarity). Worse, they encourage going around things like scanf "for performance" while, in real life, getting a program to run correctly and fast basically never entails avoiding or even tuning stdio. In a complex system, performance comes from things like avoiding I/O, passing over the data only once, and minimizing copies. Using the DBMS effectively is often key (as it were), but such things never show up in programming challenges.

Parsing and formatting numbers as text does take time, and in rare circumstances can be a bottleneck. But the answer is hardly ever to rewrite the parser. Rather, the answer is to parse the text into a convenient binary form, and use that. In short: compilation.

That said, a few observations may help.

You don't need dynamic memory for this problem, and it's not helping. The problem statement says the input array may be up to 100,000 elements, and the number of trials may be as many as 100,000. Each trial is two integer strings of up to 6 digits each separated by a space and terminated by a newline: 6 + 1 + 6 + 1 = 14. Total input, maximum is 100,000 + 1 + 6 + 1 + 100,000 * 14: under 16 KB. You are allowed 1 GB of memory.

I just allocated a single 16 KB buffer, and read it in all at once with read(2). Then I made a single pass over that input.

You got suggestions to use asynchronous I/O and threads. The problem statement says you're measured on CPU time, so neither of those help. The shortest distance between two points is a straight line; a single read into statically allocated memory wastes no motion.

One ridiculous aspect of the way they measure performance is that they use gcc -g. That means assert(3) is invoked in code that is measured for performance! I couldn't get under 4 seconds on test 22 until I removed the my asserts.

In sum, you did pretty well, and I suspect the winner you're baffled by is a phantom. Your code does faff about a bit, and you can dispense with dynamic memory and tuning stdio. I bet your time can be trimmed by simplifying it. To the extent that performance matters, that's where I'd direct your attention.

0

You should allocate all your buffers continuously. Allocate a buffer which is the size of all your buffers (num_buff, start, stop) then rearrange the points to the corresponding offsets by their size. This can reduce your cache miss \ page faults.

Since the read and the write operation seems to consume a lot of time you should consider adding threads. One thread should deal with I\O and another should deal with the computation. (It is worth checking if another thread for prints could speed things up as well). Make sure you don't use any locks while doing this.

0

Answering this question is tricky because optimization heavily depends on the problem you have. One idea is to look at the content of the file you are trying to read and see if there patterns or things that you can use in your favor. The code you wrote is a "general" solution for reading from a file, executing something and then writing to a file. But if you the file is not randomly generated each time and the content is always the same why not try to write a solution for that file?

On the other hand, you could try to use low-level system functions. One that comes to my thinking is mmap which allows you to map a file directly to memory and access that memory instead of using scanf and fgets.

Another thing I found that might help is in your solutin you are having two while loops, why not try and use only one? Another thing would be to do some Asynchronous I/O reading, so instead of reading the whole file in a loop, and then doing the calculation in another loop, you can try and read a portion at the beginning, start processing it async and continue reading. This link might help for the async part

Not the answer you're looking for? Browse other questions tagged or ask your own question.