18
\$\begingroup\$

I have attempted to implement the Sieve of Eratosthenes in C. It works rather quickly with values tested up to ten million: time ./soe.out 10000000 returns real: 0m0.218s (this is of course, not including the result printing at the end which takes the majority of the time).

Is my code the most efficient it could be, and would implementing a segmented sieve instead be a good idea for improvement?

#include <stdio.h>
#include <string.h>
#include <math.h>
#include <errno.h>
#include <stdlib.h>

#define true (1)
#define false (0)
typedef unsigned char bool;

#define MAX 10000000

/* to_int: converts a character array to an integer, returns -1 on error, -2 on out of limits */
int to_int(char* inp) {
    int len = strlen(inp);
    unsigned int out = 0, prev_out = 0, mult = 1;

    if (len == 0)
        return -1;

    for (int i = len - 1; i >= 0; i--) {
        if (inp[i] < 48 || inp[i] > 57)
            return -1;

        prev_out = out;
        out += (inp[i] - 48) * mult;
        mult *= 10;

        /* detect wrapping */
        if (out < prev_out)
            return -2;
    }

    return out;
}

int main(int argc, char **argv) {
    if (argc < 2) {
        fprintf(stderr, "Syntax: number to find primes up to (max. %d)\n", MAX);
        return EXIT_FAILURE;
    }

    unsigned int max = to_int(argv[1]);

    if (max == -1) {
        fprintf(stderr, "Syntax: number to find primes up to (you typed %s, which is not a valid number)\n", argv[1]);
        return EXIT_FAILURE;
    }

    if (max > MAX || max == -2) {
        printf("Warning: the number you typed was outside of the limit. It has been set to %d.\nPress the return key to continue...\n", MAX);
        max = MAX;
        getchar();
    }

    max++;

    /* Set up the list */
    bool *list = NULL;
    if ((list = malloc(max)) == NULL) {
        fprintf(stderr, "Error! Could not allocate the requested amount of memory: %s\nExiting...\n", strerror(errno));
        return EXIT_FAILURE;
    }

    memset(list, true, max);

    int iteration = 0;
    int max_sqrt = sqrt(max);

    for (int i = 2; i <= max_sqrt; i++) {
        if (list[i])
            for (int j = i*i; j <= max; j += i)
                list[j] = false;
    }

    for (int i = 1; i < max; i++)
        if (list[i])
            printf("%d: prime\n", i);

    free(list);

    return EXIT_SUCCESS;
}
\$\endgroup\$
2
  • \$\begingroup\$ Why do you start with $j = i*i$ ? Shouldn't it be like $j = i + i$ ? \$\endgroup\$
    – Our
    Commented Nov 13, 2017 at 10:18
  • \$\begingroup\$ @onurcanbektas : No. i*i is correct - all smaller multiples of i have already been eliminated. Consider you are eliminating multiples of 5 - there is no point eliminating 10 (even), 15 (multiple of three), or 20 (even). \$\endgroup\$ Commented Nov 13, 2017 at 10:22

5 Answers 5

9
\$\begingroup\$

As user1118321 notes, there's no need to write your own string-to-integer conversion. Just use one of the standard library functions provided for this purpose, such as atoi() (simple) or strtol() / strtoul() (provides better error handling). This also lets you get rid of the klugy use of -1 / -2 as error indicators.

Also, as suggested by Peter Cordes while I was writing this, don't define your own bool type and true/false constants. Just #include <stdbool.h> instead.


Your main loop uses what, to me, is a funny brace style:

for (int i = 2; i <= max_sqrt; i++) {
    if (list[i])
        for (int j = i*i; j <= max; j += i)
            list[j] = false;
}

For consistency, I'd expect either all of the for and if statements to have braces, or (if you really love braceless loops) none of them. Personally, I prefer to always use braces unless the entire loop / conditional is on one single line. That is, I'd prefer to write this as:

for (int i = 2; i <= max_sqrt; i++) {
    if (list[i]) {
        for (int j = i*i; j <= max; j += i) {
            list[j] = false;
        }
    }
}

If I really wanted to minimize the line count for some reason, I might squeeze the inner loop onto one line and replace the if block with an if (...) continue statement, like this:

for (int i = 2; i <= max_sqrt; i++) {
    if (!list[i]) continue;
    for (int j = i*i; j <= max; j += i) list[j] = false;
}

In this particular case, though, I'd prefer the longer version. The one-line style is fine for trivial loops or conditionals, where the whole thing can be understood at a glance. But your inner loop, with its unusual stepping and starting point, is complex enough that it IMO benefits from a bit more verbosity.


You could make your code twice as fast by treating 2 as a special case and skipping all even numbers, like this:

for (int i = 3; i <= max_sqrt; i += 2) {
    if (list[i]) {
        for (int j = i*i; j <= max; j += 2*i) {
            list[j] = false;
        }
    }
}

Note that now this code doesn't touch the even-numbered elements of list at all, so you can also reduce your memory use by not including them in your sieve at all!

To avoid cluttering your code with index adjustments, and to reduce the risk of off-by-one errors, it may be useful to define a macro (or an inline function) that takes care of the indexing, e.g. like this:

/* test whether the odd number idx is in the sieve */
static inline bool is_in_sieve(bool *sieve, int idx) {
    return sieve[idx / 2];
}
/* remove the odd number idx from the sieve */
static inline void remove_from_sieve(bool *sieve, int idx) {
    sieve[idx / 2] = false;
}

which you could then use in your code like this:

for (int i = 3; i <= max_sqrt; i += 2) {
    if (is_in_sieve(list, i)) {
        for (int j = i*i; j <= max; j += 2*i) {
            remove_from_sieve(list, j);
        }
    }
}

In principle, you could save a further 33% of your runtime and memory use by also treating 3 as a special case and leaving all multiples of 3 out of your sieve, and so on, but this process (which naturally leads to wheel factorization) starts hitting diminishing results pretty quickly.

A more practically useful trick would be to get rid of the wasted space in your sieve array by using only a single bit for each number. This will reduce your memory use by a factor of (at least) 8, at the cost of a small runtime increase.

For example, you could make your sieve a char array of (max + 15) / 16 elements, initialize them all with 0xFF, and rewrite the inline helper functions as:

/* test whether the odd number idx is in the sieve */
static inline bool is_in_sieve(char *sieve, int idx) {
    return sieve[idx / 16] & (1 << (idx % 16 / 2));
}
/* remove the odd number idx from the sieve */
static inline void remove_from_sieve(char *sieve, int idx) {
    sieve[idx / 16] &= ~(1 << (idx % 16 / 2));
}

The actual sieving loop doesn't need to be changed in any way, since all the array accesses are already encapsulated by these two helper functions!

Also, it might be simpler and more efficient to just let the sieve array be initialized to all zeros (which we can do easily with calloc()) and set the bits that correspond to numbers not in the sieve. Again, we can do this simply by changing the helper functions:

/* test whether the odd number idx is in the sieve */
static inline bool is_in_sieve(char *sieve, int idx) {
    return !( sieve[idx / 16] & (1 << (idx % 16 / 2)) );
}
/* remove the odd number idx from the sieve */
static inline void remove_from_sieve(char *sieve, int idx) {
    sieve[idx / 16] |= (1 << (idx % 16 / 2));
}

I'd be tempted to remove your arbitrary hard-coded MAX limit, since it's just that — arbitrary. If your computer can handle larger sieves, why not let it?

On the other hand, removing the upper limit does mean that we would need to check for a bunch of problems that could arise if max was too large:

  • If the type of max was wider than size_t, the argument to malloc() might overflow.
  • If max was too large to be exactly representable as a double, the value of sqrt(max) might be slightly off.
  • If max > INT_MAX - 2*sqrt_max, then j += 2*i might overflow. If j is a signed integer, this is undefined behavior; if we make j unsigned, it "merely" leads to an infinite loop.

Out of those, the last limit is probably the most restrictive, at least as long as max is of type int. Off the top of my head, I'm not sure if size_t is guaranteed not to be narrower than int, but it would be a very perverse system indeed where that wasn't the case. And since an IEEE double can store up to 53 bit integers accurately, and since (int)sqrt(x) only has about half as many bits as x, we're unlikely to see any floating-point accuracy issues unless max is greater than 2100 or so. While that's technically possible e.g. on a DSP with 128-bit ints, there are no computers with enough memory for a sieve that big. And even if there were, you'd be long dead before the CPU could even finish counting from 1 up to 2100.

So as long as we check that max <= INT_MAX - 2*sqrt(INT_MAX), we should probably be fine. If you really felt paranoid, you could also check that max / 16 < (size_t)-1 and that max == (int)(double)max, but that's almost certainly overkill unless you're using this code to control a nuclear reactor or something. And I say that as someone who tends to be very heavily into defensive coding.


So, with all those changes (and a bunch of other minor tweaks), here's how I'd rewrite your code:

#include <stdlib.h>
#include <stdbool.h>
#include <stdio.h>
#include <limits.h>
#include <math.h>
#include <errno.h>

/* test whether the odd number idx is in the sieve */
static inline bool is_in_sieve(char *sieve, int idx) {
    return !( sieve[idx / 16] & (1 << (idx % 16 / 2)) );
}
/* remove the odd number idx from the sieve */
static inline void remove_from_sieve(char *sieve, int idx) {
    sieve[idx / 16] |= (1 << (idx % 16 / 2));
}

int main(int argc, char **argv) {
    /* it's technically possible that argc == 0 */
    char *program_name = (argc > 0 ? argv[0] : "sieve");

    if (argc != 2) {
        fprintf(stderr, "Usage: %s <max>\n"
                "\tmax: number to find primes up to\n",
                program_name);
        return EXIT_FAILURE;
    }

    /* carefully parse the input number */
    errno = 0;
    char *tail;
    long max_long = strtol(argv[1], &tail, 10);
    if (errno || *tail != '\0' || max_long < 0 || max_long > INT_MAX) {
        fprintf(stderr, "%s: invalid maximum \"%s\"\n", program_name, argv[1]);
        return EXIT_FAILURE;
    }

    /* if we were lazy and careless, we'd just do max = atoi(argv[1]) */
    int max = max_long;

    /* there are no primes smaller than 2 */
    if (max < 2) return EXIT_SUCCESS;

    /* make sure that j += 2*i and the sieve allocation cannot overflow */
    if (max > INT_MAX - 2*(int)sqrt(INT_MAX) || max / 16 >= (size_t)-1) {
        fprintf(stderr, "%s: maximum %d too large\n", program_name, max);
        return EXIT_FAILURE;
    }

    /* allocate the sieve array of (max+1)/16 bytes, rounded up */
    char *sieve = calloc(max / 16 + 1, 1);
    if (!sieve) {
        fprintf(stderr, "%s: failed to allocate %d element sieve\n", program_name, max);
        return EXIT_FAILURE;
    }

    int max_sqrt = sqrt(max);

    for (int i = 3; i <= max_sqrt; i += 2) {
        if (is_in_sieve(sieve, i)) {
            for (int j = i*i; j <= max; j += 2*i) {
                remove_from_sieve(sieve, j);
            }
        }
    }

    puts("2");  /* "2 is the oddest prime, because it's even" */

    for (int i = 3; i <= max; i += 2) {
        if (is_in_sieve(sieve, i)) printf("%d\n", i);
    }

    free(sieve);

    return EXIT_SUCCESS;
}

Try it online!

On my old laptop, this takes about 1 minute 20 seconds and about 122 MiB of memory to generate all primes up to 2,000,000,000 (and to print them to /dev/null). Out of that 122 MiB, the sieve array takes up a little over 119 MiB, with the rest presumably being miscellaneous overhead.

If you want to go even higher, then at some point switching to a segmented sieve could indeed be a good idea.

\$\endgroup\$
6
  • \$\begingroup\$ "not sure if size_t is guaranteed not to be narrower than int" --> This is not guaranteed (no specification for this). It is simply uncommon. \$\endgroup\$ Commented Nov 13, 2017 at 21:31
  • \$\begingroup\$ @chux: Fair enough. Added an explicit check for it. \$\endgroup\$ Commented Nov 13, 2017 at 23:13
  • \$\begingroup\$ When I attempt to print all the primes from 2,000,000,000 to stdout, I get a seg. fault. Is there any way to avoid a core dump and detect the extremely high amount and terminate the program before the seg. fault occurs? \$\endgroup\$ Commented Nov 14, 2017 at 1:33
  • \$\begingroup\$ @carefulnow: That's strange. A segfault implies there's a bug somewhere, but the code above runs just fine for me with a maximum of 2000000000, and I don't see what could be making it crash. What compiler and OS are you using, and are you compiling for a 32 or a 64 bit target? And can you run the code under a debugger and see which line it segfaults on? \$\endgroup\$ Commented Nov 14, 2017 at 2:43
  • 1
    \$\begingroup\$ With enough optimizations (segmenting, more wheels, prefill, etc.) one gets to the point where the printf loop takes more time than sieving. Of course you don't want to optimize that until needed, but just pointing out it does happen (e.g. printing to 2e9 at over 8x faster than the listed code, using less memory, and works for any 64-bit {a,b} range), and hand-rolling a buffered integer output is 2-3x faster than printf. \$\endgroup\$
    – DanaJ
    Commented Nov 25, 2017 at 4:35
19
\$\begingroup\$

This works really well and is very easy to read! Here are some suggestions.

Don't Reinvent the Wheel

There are plenty of ways to convert a string to a number in C. There's no need to write your own. You could instead use atoi(), strtol() or sscanf().

Use the Right Comparison

You made max be an unsigned int, but then you compare it against -1 and -2. Those values are not unsigned. I would make some named constants for those comparisons. Something like:

static const unsigned int ERROR_NO_NUMBERS = UINT_MAX;
static const unsigned int ERROR_MAX_BAD_VALUE = ERROR_NO_NUMBERS - 1;

Note that the value for ERROR_NO_NUMBERS depends on the size of an int on your platform. I assumed it was 32 bits.

Prefer const to #define

Preprocessor macros created with #define are problematic. They can lead to very hard to understand code with surprising effects. For defining a constant it's not a huge deal, but it's good to get into the habit of using const for constants and inline functions for calculations when possible instead of macros because of the problems they have.

Performance

Aside from using a different algorithm, I don't see a lot of ways to improve the speed of this. One minor improvement might come from filling the array with alternating true/false values at the start so you iterate over fewer values. Something like this:

// Skip the first 4 values
list [ 0 ] = true;
list [ 1 ] = true;
list [ 2 ] = true;
list [ 3 ] = true;
// This value represents 4 bools in order false, true, false, true
const bool FALSE_TRUE_FALSE_TRUE[] = {false, true, false, true};
// Set all even values over 3 to false since we know they're divisible by 2
memset_pattern4(&list [ 4 ], FALSE_TRUE_FALSE_TRUE, max - 4);

You can then skip checking for 2 and start at 3:

for (int i = 3; i <= max_sqrt; i++) {
    if (list[i])
        for (int j = i*i; j <= max; j += i)
            list[j] = false;
}

Unused Variables

The iteration variable doesn't appear to be used anywhere. It should be removed.

Error Handling

I find your error handling a little odd. (Although it's great that you have it! It's often missed. Nice work.) You check to see if there's no number, and if so, output an error and exit. But if there is a number but it's out of range, you use the maximum allowed value and continue on. I'd make them both do the same thing. Either both exit or both use the maximum and continue on. (Personally, I think stopping is smarter.)

\$\endgroup\$
8
  • 1
    \$\begingroup\$ Thanks for your advice! However, if preprocessor macros created with #define are seen as problematic, is there ever an appropriate use for them, or are they just a thing of the past that should be avoided? \$\endgroup\$ Commented Nov 12, 2017 at 16:49
  • 1
    \$\begingroup\$ This link explains it pretty well. If you need to use a value as an array size initializer, and are using strict ANSI-C, then you can't use a const variable, you must use a #define. Most C compilers have a setting to relax this, though, I'm pretty sure. \$\endgroup\$ Commented Nov 12, 2017 at 18:35
  • \$\begingroup\$ const unsigned int ERROR_NO_NUMBERS = UINT_MAX; \$\endgroup\$ Commented Nov 12, 2017 at 23:31
  • \$\begingroup\$ Use static const unsigned int if you want to let the compiler avoid actually storing it in memory if it wants to. \$\endgroup\$ Commented Nov 12, 2017 at 23:34
  • 1
    \$\begingroup\$ Where does that memset_pattern4 come from? I can't find it in the standard library. \$\endgroup\$ Commented Nov 13, 2017 at 11:17
11
\$\begingroup\$

Don't define bool yourself

typedef unsigned char bool; is technically legal in C, but very bad style. If you want a proper bool in C that works like it does in C++ (guaranteed to only be 0 or 1), #include <stdbool.h>.

If you want a 0 / non-zero integer type that avoids the possible inefficiencies of bool (Boolean values as 8 bit in compilers. Are operations on them inefficient?), call it something else. bool is a keyword in C++.


Use character constants, not ASCII codes

It's not usually useful to write your own string->int functions in the first place. But there are other times where you want to work with characters.

if (inp[i] < 48 || inp[i] > 57)

This is a very low-readability way to write isdigit(inp[i]). But maybe the library isdigit is slow and you don't want it to check if a locale is set or any nonsense. Hard-coding ASCII constants is still usually not the right way to go. Use character literals like this to make it more readable:

if (inp[i] < '0' || inp[i] > '9')

As a bonus, now your code is portable to EBCDIC or other non-ASCII C implementations. (Hard-coding 48 and 57 would be useful if you explicitly want to work with ASCII, regardless of the native character set of the C implementation.) BTW, the C standard does guarantee that the integer values of '0' through '9' are contiguous and in the expected order, so '0' + single_digit_number works.

The only even prime is 2

for (int i = 2; i <= max_sqrt; i++) is a good start (only going up to sqrt(max)), but you can skip the even numbers. If you start with i=3 and do i+=2, you avoid wasting time testing list[i] for even numbers. (set them to false in a startup loop instead of memset(true)).

You can compact your list by not even storing the even numbers in it. So list[i] represents the primality of 2*i + 1. This cuts your cache footprint in half. It does make the logic more tricky, though. Let's take 3 as an example. list[i=1] corresponds to the primality of 3, so we need to mark 9 as composite (stored in list[4]). But 12 is even, so it's not in our list. 15 is stored in list[7], and so on. So we still do j += 3, but it really means we're skipping the even multiples. Squaring is also tricky.

memset(list, true, max);
// max_sqrt should actually be (sqrt(max*2+1) - 1) / 2
for (int i = 1; i <= max_sqrt; i++) {
    if (list[i])  {
        int prime = 2*i + 1;
         // j_start = (prime*prime - 1) / 2  but avoiding overflow, or signed division or shift
         //           ((4*i*i + 4*i + 1) - 1) / 2
         //           2*i*i + 2*i
        for (int j = 2*i*i + 2*i; j <= max; j += prime)
            list[j] = false;       // actually striding by prime*2, but with compressed indexing
    }
}

puts("2: prime");
for (int i = 1; i < max; i++)
    if (list[i])
        printf("%d: prime\n", 2*i + 1);

I'd probably use unsigned for many of these.


More complicated indexing is possible to bake in other small primes like 3 and 5. Much has been written about prime sieving, it's always possible to make an implementation more complicated to gain some performance.

One major thing for large data sets is using a bitmap (8 true/false elements per byte) instead of only one per unsigned char. This requires a read-modify-write to only clear one bit, so it only pays for itself with a large enough max that you start to get cache misses with the per-byte version.

See Is using a vector of boolean values slower than a dynamic bitset? for some benchmarks on this, specifically in the context of a Sieve. Note that C++'s vector<bool> is a bitmap, but you can implement one yourself in C with shifts. (Making it portable requires using CHAR_BIT if you don't want to assume 8-bit char).

To optimize the bit-setting for Sieve, you can make a bit-mask to AND or OR and rotate it instead of re-creating it for each word of the bitmap.

Note that you can store your bitmap as 0 means prime or 1 means prime, so you can choose whichever is more efficient: generating a value with all-but-one bit set, or a value with only one bit set.

Anyway, all of this is going well beyond what you were aiming for with your implementation.

If you're curious about performance, profile your code. e.g. on Linux, perf stat -d ./soe to show CPU hardware performance counters for cache misses. Make sure you compiled with gcc -O3 -march=native, or with optimizations enabled for whatever other compiler you use.

\$\endgroup\$
6
  • \$\begingroup\$ In fact, if implementing your own bitmap, then you could have CHAR_BIT elements per byte (just like std::vector<bool>), which may be more than 8 on certain platforms... \$\endgroup\$ Commented Nov 13, 2017 at 11:24
  • \$\begingroup\$ @TobySpeight: A char isn't necessarily a byte. e.g. a 32-bit word-addressable DSP will use 32-bit char (because it's the smallest unit of "memory locations"). Word-addressable machines are just word-addressable, that doesn't mean they have 32-bit bytes. (But C11 rules require that separate threads can modify separate char objects without data races / inventing writes to neighbouring objects, e.g. read-modify-write of a containing word.) \$\endgroup\$ Commented Nov 13, 2017 at 16:33
  • \$\begingroup\$ @TobySpeight: Typical std::vector<bool> implementations (like libc++ or libstdc++) use unsigned long chunks, IIRC. That's not what you want here: smaller operand size is good (for x86 at least) because for small strides, you want nearby read-modify-write operations to be modifying different bytes instead of read-modify-writing the same 64-bit chunk of memory multiple times (although this only gets bad with a stride of 3, and that happens once. It's not like a histogram or set-membership bitmap where repeated input could make you modify nearby elements repeatedly. \$\endgroup\$ Commented Nov 13, 2017 at 16:36
  • \$\begingroup\$ True, but a byte isn't necessarily 8 bits (the term "octet" exists for 8-bit chunks). Not that we see many machines with 9-bit bytes these days, and newer standards require at least 256 distinct values of char IIRC. I'd guess that cache line sizes are as significant as the chunk size of std::bitset and std::vector<bool>, particularly if splitting the implementation across threads. None of this is to say your review isn't good - you already have a +1 from me! \$\endgroup\$ Commented Nov 13, 2017 at 16:38
  • \$\begingroup\$ @TobySpeight: right, using CHAR_BIT makes your code work on HW with 9-bit bytes as well as machines where a char isn't a byte at all. My point is that char isn't necessarily a byte. re: chunk size: your choice of chunk size isn't influenced by cache-line size. But sure, a machine with smaller cache lines might be faster once your strides are big enough; less wasted RMW of untouched data. But some CPUs (e.g. single thread on a many-core Xeon) are limited by the maximum concurrency (of outstanding cache misses) more than by raw DRAM bandwidth. stackoverflow.com/a/43574756/224132 \$\endgroup\$ Commented Nov 13, 2017 at 17:08
3
\$\begingroup\$

int overflow - which is undefined behavior

See below:
Assume argv[i] was text version of INT_MAX. int j = i*i; will cause j to take on values increasingly closer to INT_MAX and then j += i can int overflow. (UB).

j <= max;, which is j <= (unsigned)INT_MAX; in this case, is alway true as long as j is positive. j only becomes negative due to the UB of int overflow of j += i.

Code may appear to perform correctly for OP as typical UB here is to wrap the sum to a very negative number and the later j <= max coverts the negative number j to an unsigned with the hoped for compare functionality. Yet it is too late, UB has occurred prior.

unsigned int max = to_int(argv[1]);
...
max++;
...
int max_sqrt = sqrt(max);
for (int i = 2; i <= max_sqrt; i++) {
    if (list[i])
        for (int j = i*i; j <= max; j += i)  // j += i could overflow
            list[j] = false;
}

Instead avoid OF and start from the top and work down. The j -= i will not integer "underflow", even if i,j were int or unsigned.

for (int i = 2; i <= max_sqrt; i++) {
    if (list[i])
        int jmax = max/i*i;
        int jmin = i*i;  
        for (int j = jmax; j >= jmin; j -= i)
            list[j] = false;
}

Candidate performance gain

Rather than count down to jmin as proposed above, offset to 0 to simplify the end of loop test to a compare to 0. Of course profiling is needed to see it this improves code per a given compiler - yet it is a good candidate to improve the innermost loop of the sieve.

        int jmax = max/i*i;
        int jmin = i*i;  
        bool *base = &list[jmin];
        int j = jmax - jmin;
        base[j] = false;
        while (j) {
          j -= i;
          base[j] = false;
        }
\$\endgroup\$
2
\$\begingroup\$
  1. You could maybe gain something on tighter packings. Packing 8 logical values into each byte and do bit-manipulation for setting/reading 1s and 0s (true or false). This would let you have 8 times larger table at same memory drain, at the cost of some (very computationally cheap) arithmetic and bit fiddling for lookup.
  2. Another thing you could do is to make the application multi threaded. Split the responsibility of nulling multiples of primes among your CPUs cores. If you plan the memory accesses right you could probably guarantee individual cache hits for each core.
\$\endgroup\$
11
  • 1
    \$\begingroup\$ Safely and efficiently manipulating shared data from multiple threads is tricky, to say the least. For example, it's quite possible for one thread to fail to see changes made by another due to caching, or even to accidentally erase changes made by other threads while flushing its cache to shared RAM. There are ways to avoid such problems, but trying to do it without introducing inefficient locks is not trivial. \$\endgroup\$ Commented Nov 12, 2017 at 23:48
  • \$\begingroup\$ @IlmariKaronen : yes, if they need to communicate with one another and/or share memory, but in this particular application I think we can avoid the risk of not seeing the changes. For example we can split the memory into equal sized chunks where each thread is responsible for only it's own local area. \$\endgroup\$ Commented Nov 13, 2017 at 5:49
  • \$\begingroup\$ @IlmariKaronen: I think you'd lose more than you gain from having every thread update a shared list[] with atomic bit-clearing. Besides making every thread run slower (because atomic ops are slower), they would compete with each other for access to the same cache lines. Or if you don't use a bitmap, then each thread can still just do simple relaxed or release byte-stores (_Atomic uint8_t list[]), but the sharing is still a big problem. \$\endgroup\$ Commented Nov 13, 2017 at 16:52
  • \$\begingroup\$ Maybe a segmented sieve could thread better, splitting things up by the area you write in rather than which primes you write. That way yes you could maybe have significant cache hits if you have a lot of cores with enough L2 to keep much of the list in memory. (e.g. Skylake Xeon with 1MB per core L2 and core counts of up to 28 cores per socket.) \$\endgroup\$ Commented Nov 13, 2017 at 16:55
  • 1
    \$\begingroup\$ @PeterCordes, an alternative approach to segmentation is to use a different sieve. Sundaram's sieve is simple and amenable to this kind of segmentation; Atkin-Bernstein is less simple, but very segmentable. \$\endgroup\$ Commented Nov 14, 2017 at 10:00

Not the answer you're looking for? Browse other questions tagged or ask your own question.