High throughput Fizz Buzz

Question

Fizz Buzz is a common challenge given during interviews. The challenge goes something like this:

Write a program that prints the numbers from 1 to n. If a number is divisible by 3, write Fizz instead. If a number is divisible by 5, write Buzz instead. However, if the number is divisible by both 3 and 5, write FizzBuzz instead.

The goal of this question is to write a FizzBuzz implementation that goes from 1 to infinity (or some arbitrary very very large number), and that implementation should do it as fast as possible.

Checking throughput

Write your fizz buzz program. Run it. Pipe the output through <your_program> | pv > /dev/null. The higher the throughput, the better you did.

Example

A naive implementation written in C gets you about 170MiB/s on an average machine:

#include <stdio.h>

int main() {
    for (int i = 1; i < 1000000000; i++) {
        if ((i % 3 == 0) && (i % 5 == 0)) {
            printf("FizzBuzz\n");
        } else if (i % 3 == 0) {
            printf("Fizz\n");
        } else if (i % 5 == 0) {
            printf("Buzz\n");
        } else {
            printf("%d\n", i);
        }
    }
}

There is a lot of room for improvement here. In fact, I've seen an implementation that can get more than 3GiB/s on the same machine.

I want to see what clever ideas the community can come up with to push the throughput to its limits.

Rules

All languages are allowed.
The program output must be exactly valid fizzbuzz. No playing tricks such as writing null bytes in between the valid output - null bytes that don't show up in the console but do count towards pv throughput.

Here's an example of valid output:

1
2
Fizz
4
Buzz
Fizz
7
8
Fizz
Buzz
11
Fizz
13
14
FizzBuzz
16
17
Fizz
19
Buzz

# ... and so on

Valid output must be simple ASCII, single-byte per character, new lines are a single \n and not \r\n. The numbers and strings must be correct as per the FizzBuzz requirements. The output must go on forever (or a very high astronomical number, at-least 2^58) and not halt / change prematurely.

Parallel implementations are allowed (and encouraged).
Architecture specific optimizations / assembly is also allowed. This is not a real contest - I just want to see how people push fizz buzz to its limit - even if it only works in special circumstances/platforms.

Scores

Scores are from running on my desktop with an AMD 5950x CPU (16C / 32T). I have 64GB of 3200Mhz RAM. CPU mitigations are disabled.

**By far the best score so far is in C++ by @David Frank - generating FizzBuzz at around 1.7 Terrabit/s

At the second place - @ais523 - generating FizzBuzz at 61GiB/s with x86 assembly.

Results for java:

Author	Throughput
ioan2	2.6 GiB/s
randy	0.6 GiB/s
randy_ioan	3.3 GiB/s
ioan	4.6 GiB/s
olivier	5.8 GiB/s

Results for python:

Author	Throughput
bconstanzo	0.1 GiB/s
michal	0.1 GiB/s
ksousa_chunking	0.1 GiB/s
ksousa_initial	0.0 GiB/s
arrmansa	0.2 GiB/s
antoine	0.5 GiB/s

Results for pypy:

Author	Throughput
bconstanzo_pypy	0.3 GiB/s

Results for rust:

Author	Throughput
aiden	3.4 GiB/s
xavier	0.9 GiB/s

Results for ruby:

Author	Throughput
lonelyelk	0.1 GiB/s
dgutov	1.7 GiB/s

Results for asm:

Author	Throughput
ais523	60.8 GiB/s
paolo	39.0 GiB/s

Results for julia:

Author	Throughput
marc	0.7 GiB/s
tkluck	15.5 GiB/s

Results for c:

Author	Throughput
random832	0.5 GiB/s
neil	8.4 GiB/s
kamila	8.4 GiB/s
xiver77	20.9 GiB/s
isaacg	5.7 GiB/s

Results for cpp:

Author	Throughput
jdt	4.8 GiB/s
tobycpp	5.4 GiB/s
david	208.3 GiB/s

Results for numba:

Author	Throughput
arrmansa	0.1 GiB/s

Results for numpy:

Author	Throughput
arrmansa	0.3 GiB/s
arrmansa-multiprocessing	0.7 GiB/s
arrmansa-multiprocessing-2	0.7 GiB/s

Results for go:

Author	Throughput
Bysmyyr	3.7 GiB/s
psaikko	6.8 GiB/s

Results for php:

Author	Throughput
no_gravity	0.5 GiB/s

Results for elixir:

Author	Throughput
technusm1	0.3 GiB/s

Results for csharp:

Author	Throughput
neon-sunset	1.2 GiB/s

Plots generated using https://github.com/omertuc/fizzgolf

How are we scoring speed? If you change the machine the speed will change drastically. You mentioned architecture specific optimizations as well, so it seems like it is tested on multiple different architectures. How do you ensure that comparison is fair? — Wheat Wizard, Commented Nov 14, 2020 at 18:07
That's definitely a problem. I will try and run the implementations posted here on my desktop and keep score in the main question by editing it when new answers are posted. If it's something architecture specific I can't run, it'll still interesting to see, doesn't have to compete in the same score table. — Omer Tuchfeld, Commented Nov 14, 2020 at 18:09
@agtoever note that the benchmark is done using piping to pv, which means that the Linux default pipe buffer size of 64K is used, as well as pv's default 64K buffer size. My entry exploits these properties by setting the buffer size to 64K, but it still blocks on write due to many factors; one of which is the generation speed being affected by the varying length of the printed numbers. — Isaac G., Commented Nov 15, 2020 at 10:29
related: My answer on an SO question about x86-64 asm fizz buzz includes a tutorial on tuning some parts of it for performance. (e.g. unroll by 3 and use a down-counter starting at 5 instead of modulo on i). But it still uses a slow div for int->string, and critically uses write for each group of 15 output lines built up in a buffer. Full-on perf tuning would probably just unroll by 15, and of course use much bigger buffers. — Peter Cordes, Commented Nov 16, 2020 at 18:25
Hoe do you measure throughput of a program which runs "for ever"? If I have a program which first does calculations for a year, then outputs a googol lines in a minute, it will lose if you stop measuring if you cut off the measurement within a year, but it'll be a winner if you measure for a year and a minute. This is of course an extreme example, but many programs will get slower when numbers get larger, so how you long you measure matters. — Abigail, Commented Nov 21, 2020 at 11:30

hanshenrik · Accepted Answer · 2022-06-05 09:17:12Z

x86-64+AVX2 assembly language (Linux, `gcc`+`gas`)

Build and usage instructions

This program is most conveniently built using gcc. Save it as fizzbuzz.S (that's a capital S as the extension), and build using the commands

gcc -mavx2 -c fizzbuzz.S
ld -o fizzbuzz fizzbuzz.o

Run as ./fizzbuzz piped into one command, e.g. ./fizzbuzz | pv > /dev/null (as suggested in the question), ./fizzbuzz | cat, or ./fizzbuzz | less. To simplify the I/O, this will not work (producing an error on startup) if you try to output to a file/terminal/device rather than a pipe. Additionally, this program may produce incorrect output if piped into two commands (e.g. ./fizzbuzz | pv | cat > fizzbuzz.txt), but only in the case where the middle command uses the splice system call; this is either a bug in Linux (very possible with system calls this obscure!) or a mistake in the documentation of the system calls in question (also possible). However, it should work correctly for the use case in the question, which is all that matters on CGCC.

This program is somewhat system-specific; it requires the operating system to be a non-ancient version of Linux, and the processor to be an x86-64 implementation that supports AVX2. (Most moderately recent processors by Intel and AMD have AVX2 support, including the Ryzen 9 mentioned in the question, and almost all use the x86-64 instruction set.) However, it avoids assumptions about the system it's running on beyond those mentioned in the header, so there's a decent chance that if you can run Linux, you can run this.

The program outputs a quintillion lines of FizzBuzz and then exits (going further runs into problems related to the sizes of registers). This would take tens of years to accomplish, so hopefully counts as "a very high astronomical number" (although it astonishes me that it's a small enough timespan that it might be theoretically possible to reach a number as large as a quintillion without the computer breaking).

As a note: this program's performance is dependent on whether it and the program it outputs to are running on sibling CPUs or not, something which will be determined arbitrarily by the kernel when you start it. If you want to compare the two possible timings, use taskset to force the programs onto particular CPUs: taskset 1 ./fizzbuzz | taskset 2 pv > /dev/null versus taskset 1 ./fizzbuzz | taskset 4 pv > /dev/null. (The former will probably run faster, but might be slower on some CPU configurations.)

Discussion

I've spent months working on this program. I've long thought that "how fast can you make a FizzBuzz" would be a really interesting question for learning about high-performance programming, and when I subsequently saw this question posted on CGCC, I pretty much had to try.

This program aims for the maximum possible single-threaded performance. In terms of the FizzBuzz calculation itself, it is intended to sustain a performance of 64 bytes of FizzBuzz per 4 clock cycles (and is future-proofed where possible to be able to run faster if the relevant processor bottleneck – L2 cache write speed – is ever removed). This is faster than a number of standard functions. In particular, it's faster than memcpy, which presents interesting challenges when it comes to I/O (if you try to output using write then the copies in write will take up almost all the runtime – replacing the I/O routine here with write causes the performance on my CPU to drop by a factor of 5). As such, I needed to use much more obscure system calls to keep I/O-related copies to a minimum (in particular, the generated FizzBuzz text is only sent to main memory if absolutely necessary; most of the time it's stored in the processor's L2 cache and piped into the target program from there, which is why reading it from a sibling CPU can boost performance – the physical connection to the L2 cache is shorter and higher bandwidth than it would be to a more distant CPU).

On my computer (which has a fairly recent, but not particularly powerful, Intel processor), this program generates around 31GiB of FizzBuzz per second. I'll be interested to see how it does on the OP's computer.

I did experiment with multithreaded versions of the program, but was unable to gain any speed. Experiments with simpler programs show that it could be possible, but any gains may be small; the cost of communication between CPUs is sufficiently high to negate most of the gains you could get by doing work in parallel, assuming that you only have one program reading the resulting FizzBuzz (and anything that writes to memory will be limited by the write speed of main memory, which is slower than the speed with which the FizzBuzz can be generated).

The program

This isn't code-golf, so my explanation of the program and its algorithm are given as comments in the program itself. (I still had to lightly golf the program, and especially the explanation, to fit this post within the 65536 byte size limit.)

The program is written in a "literate" assembly style; it will be easiest to understand if you read it in order, from start to end. (I also added a number of otherwise useless line labels to separate the program into logical groups of instructions, in order to make the disassembly easier to read, if you're one of the people who prefers to read assembly code like that.)

.intel_syntax prefix

// Header files.
#include <asm/errno.h>
#include <asm/mman.h>
#include <asm/unistd.h>
#define F_SETPIPE_SZ 1031 // not in asm headers, define it manually

// The Linux system call API (limited to 4 arguments, the most this
// program uses). 64-bit registers are unsuffixed; 32-bit have an "e"
// suffix.
#define ARG1 %rdi
#define ARG1e %edi
#define ARG2 %rsi
#define ARG2e %esi
#define ARG3 %rdx
#define ARG3e %edx
#define ARG4 %r10
#define ARG4e %r10d
#define SYSCALL_RETURN %rax
#define SYSCALL_RETURNe %eax
#define SYSCALL_NUMBER %eax

// %rax, %rcx, %rdx, %ymm0-3 are general-purpose temporaries. Every
// other register is used for just one or two defined purposes; define
// symbolic names for them for readability. (Bear in mind that some of
// these will be clobbered sometimes, e.g. OUTPUT_LIMIT is clobbered
// by `syscall` because it's %r11.)
#define OUTPUT_PTR %rbx
#define BYTECODE_IP %rbp
#define SPILL %rsi
#define BYTECODE_GEN_PTR %rdi
#define REGEN_TRIGGER %r8
#define REGEN_TRIGGERe %r8d
#define YMMS_AT_WIDTH %r9
#define YMMS_AT_WIDTHe %r9d
#define BUZZ %r10
#define BYTECODE_NEG_LEN %r10
#define FIZZ %r11
#define FIZZe %r11d
#define OUTPUT_LIMIT %r11
#define BYTECODE_END %r12
#define BYTECODE_START %r13
#define BYTECODE_STARTe %r13d
#define PIPE_SIZE %r13
#define LINENO_WIDTH %r14
#define LINENO_WIDTHe %r14d
#define GROUPS_OF_15 %r15
#define GROUPS_OF_15e %r15d
#define LINENO_LOW %ymm4
#define LINENO_MID %ymm5
#define LINENO_MIDx %xmm5
#define LINENO_TOP %ymm6
#define LINENO_TOPx %xmm6
#define LINENO_MID_TEMP %ymm7
#define ENDIAN_SHUFFLE %ymm8
#define ENDIAN_SHUFFLEx %xmm8
#define LINENO_LOW_INCR %ymm9
#define LINENO_LOW_INCRx %xmm9

// The last six vector registers are used to store constants, to avoid
// polluting the cache by loading their values from memory.
#define LINENO_LOW_INIT %ymm10
#define LINENO_MID_BASE %ymm11
#define LINENO_TOP_MAX %ymm12
#define ASCII_OFFSET %ymm13
#define ASCII_OFFSETx %xmm13
#define BIASCII_OFFSET %ymm14
#define BASCII_OFFSET %ymm15


// Global variables.
.bss
.align 4 << 20
// The most important global variables are the IO buffers. There are
// two of these, each with 2MiB of memory allocated (not all of it is
// used, but putting them 2MiB apart allows us to simplify the page
// table; this gives a 30% speedup because page table contention is
// one of the main limiting factors on the performance).
io_buffers:
.zero 2 * (2 << 20)
// The remaining 2MiB of memory stores everything else:
iovec_base:          // I/O config buffer for vmsplice(2) system call
.zero 16
error_write_buffer:  // I/O data buffer for write(2) system call
.zero 1
.p2align 9,0
bytecode_storage:    // the rest is a buffer for storing bytecode
.zero (2 << 20) - 512


// The program starts here. It doesn't use the standard library (or
// indeed any libraries), so the start point is _start, not main.
.text
.globl _start
_start:

// This is an AVX2 program, so check for AVX2 support by running an
// AVX2 command. This is a no-op, but generates SIGILL if AVX2 isn't
// supported.
vpand %ymm0, %ymm0, %ymm0

// Initialize constant registers to their constant values.
vmovdqa LINENO_LOW_INIT, [%rip + lineno_low_init]
vmovdqa LINENO_MID_BASE, [%rip + lineno_mid_base]
vmovdqa LINENO_TOP_MAX, [%rip + lineno_top_max]
vmovdqa ASCII_OFFSET, [%rip + ascii_offset]
vmovdqa BIASCII_OFFSET, [%rip + biascii_offset]
vmovdqa BASCII_OFFSET, [%rip + bascii_offset]

// Initialize global variables to their initial values.
vmovdqa ENDIAN_SHUFFLE, [%rip + endian_shuffle_init]
vmovdqa LINENO_TOP, [%rip + lineno_top_init]

// Check the size of the L2 cache.
//
// This uses the CPUID interface. To use it safely, check what range
// of command numbers is legal; commands above the legal range have
// undefined behaviour, commands within the range might not be
// implemented but will return all-zeros rather than undefined values.
// CPUID clobbers a lot of registers, including some that are normally
// call-preserved, so this must be done first.
mov %eax, 0x80000000 // asks which CPUID extended commands exist
cpuid                // returns the highest supported command in %eax
cmp %eax, 0x80000006 // does 0x80000006 give defined results?
jb bad_cpuid_error

mov %eax, 0x80000006 // asks about the L2 cache size
cpuid                // returns size in KiB in the top half of %ecx
shr %ecx, 16
jz bad_cpuid_error   // unsupported commands return all-0s

// Calculate the desired pipe size, half the size of the L2 cache.
// This value is chosen so that the processor can hold a pipeful of
// data being output, plus a pipeful of data being calculated, without
// needing to resort to slow L3 memory operations.
shl %ecx, 10 - 1     // convert KiB to bytes, then halve
mov PIPE_SIZE, %rcx

// Ask the kernel to resize the pipe on standard output.
mov ARG1e, 1
mov ARG2e, F_SETPIPE_SZ
mov ARG3e, %ecx
mov SYSCALL_NUMBER, __NR_fcntl
syscall
cmp SYSCALL_RETURNe, -EBADF
je pipe_error
cmp SYSCALL_RETURNe, -EPERM
je pipe_perm_error
call exit_on_error
cmp SYSCALL_RETURN, PIPE_SIZE
jne pipe_size_mismatch_error

// Ask the kernel to defragment the physical memory backing the BSS
// (read-write data) segment. This simplifies the calculations needed
// to find physical memory addresses, something that both the kernel
// and processor would otherwise spend a lot of time doing, and
// speeding the program up by 30%.
lea ARG1, [%rip + io_buffers]
mov ARG2e, 3 * (2 << 20)
mov ARG3e, MADV_HUGEPAGE
mov SYSCALL_NUMBER, __NR_madvise
syscall
call exit_on_error

// From now on, OUTPUT_PTR is permanently set to the memory location
// where the output is being written. This starts at the start of the
// first I/O buffer.
lea OUTPUT_PTR, [%rip + io_buffers]


///// First phase of output
//
// The FizzBuzz output is produced in three distinct phases. The first
// phase is trivial; just a hardcoded string, that's left in the
// output buffer, to be output at the end of the second phase.

first_phase:

.section .rodata
fizzbuzz_intro:
.ascii "1\n2\nFizz\n4\nBuzz\nFizz\n7\n8\nFizz\n"
.text
vmovdqu %ymm0, [%rip + fizzbuzz_intro]
vmovdqu [OUTPUT_PTR], %ymm0
add OUTPUT_PTR, 30


///// Second phase of output
//
// This is a routine implementing FizzBuzz in x86-64+AVX2 assembler in
// a fairly straightforward and efficient way. This isn't as fast as
// the third-phase algorithm, and can't handle large numbers, but will
// introduce some of the basic techniques this program uses.

second_phase_init:

// The outer loop of the whole program breaks the FizzBuzz output into
// sections where all the line numbers contain the same number of
// digits. From now on, LINENO_WIDTH tracks the number of digits in
// the line number. This is currently 2; it ranges from 2-digit
// numbers to 18-digit numbers, and then the program ends.
mov LINENO_WIDTHe, 2

// GROUPS_OF_15 is permanently set to the number of groups of 15 lines
// that exist at this line number width; it's multiplied by 10 whenever
// LINENO_WIDTH is incremented.
//
// A general note about style: often the program uses numbers that are
// statically known to fit into 32 bits, even in a register that's
// conceptually 64 bits wide (like this one). In such cases, the
// 32-bit and 64-bit versions of a command will be equivalent (as the
// 32-bit version zero-extends to 64-bits on a 64-bit processor); this
// program generally uses the 32-bit version, both because it
// sometimes encodes to fewer bytes (saving cache pressure), and
// because some processors recognise zeroing idioms only if they're 32
// bits wide.
mov GROUPS_OF_15e, 6

// Some constants used throughout the second phase, which permanently
// stay in their registers. Note that short string literals can be
// stored in normal integer registers - the processor doesn't care.
mov FIZZ, 0x0a7a7a6946  // "Fizz\n"
mov BUZZ, 0x0a7a7a7542  // "Buzz\n"

.section .rodata
.p2align 5, 0
second_phase_constants:
.byte 0, 0, 0, 0, 0, 0, 0, 0
.byte 1, 0, 0, 0, 0, 0, 0, 0
.text
vmovdqa %xmm3, [%rip + second_phase_constants]

// This program makes extensive use of a number format that I call
// "high-decimal". This is a version of decimal where the digit 0 is
// encoded as the byte 246, the digit 1 as the byte 247, ..., the
// digit 9 as the byte 255. The bytes are stored in the normal
// endianness for the processor (i.e. least significant first), and
// padded to a known length (typically 8 digits) with leading zeroes.
//
// The point of high-decimal is that it allows us to use arithmetic
// operators intended for binary on high-decimal numbers, and the
// carries will work the same way (i.e. the same digits will carry,
// although carries will be 0-based rather than 246-based); all that's
// required is to identify the digits that carried and add 246 to
// them. That means that the processor's binary ALU can be used to do
// additions directly in decimal - there's no need for loops or
// anything like that, and no need to do binary/decimal conversions.
//
// The first use for high-decimal is to store the line number during
// the second phase (it's stored differently in the third phase).
// It's stored it in the top half of %xmm1 (although it's only 64 bits
// wide, it needs to be in a vector register so that it can be
// interpreted as 8 x 8 bits when necessary; general-purpose
// registers can't do that). The bottom half of %xmm1 is unused, and
// frequently overwritten with arbitrary data.
.section .rodata
line_number_init:
#define REP8(x) x,x,x,x,x,x,x,x
.byte REP8(0)
.byte 246, 247, 246, 246, 246, 246, 246, 246
.text
vmovdqa %xmm1, [%rip + line_number_init]

// Writing line numbers is nontrivial because x86-64 is little-endian
// but FizzBuzz output is big-endian; also, leading zeroes aren't
// allowed. ENDIAN_SHUFFLE is used to fix both these problems; when
// used to control the vector shuffler, it reverses the order of a
// vector register, and rotates the elements to put the first digit
// (based on LINENO_WIDTH) into the first byte. (This method is used
// by both the second and third phases; the second phase uses only the
// bottom half, with the top half used by the third phase, but they
// are both initialized together.)
.section .rodata
endian_shuffle_init:
.byte 9, 8, 7, 6, 5, 4, 3, 2
.byte 1, 0, 255, 254, 253, 252, 251, 250
.byte 3, 2, 1, 0, 255, 254, 253, 252
.byte 251, 250, 249, 248, 247, 246, 245, 244
.text


second_phase_per_width_init:

// The second phase writing routines are macros.
//
// Fizz and Buzz are trivial. (This writes a little beyond the end of
// the string, but that's OK; the next line will overwrite them.)
#define WRITE_FIZZ   mov [OUTPUT_PTR], FIZZ; add OUTPUT_PTR, 5
#define WRITE_BUZZ   mov [OUTPUT_PTR], BUZZ; add OUTPUT_PTR, 5

// For FizzBuzz, output 32 bits of FIZZ to write "Fizz" with no
// newline, then write a "Buzz" after that.
#define WRITE_FIZZBUZZ \
  mov [OUTPUT_PTR], FIZZe; mov [OUTPUT_PTR + 4], BUZZ; \
  add OUTPUT_PTR, 9

// To write a line number, add 58 to each byte of the line number
// %xmm1, fix the endianness and width with a shuffle, and write a
// final newline.
.section .rodata
ascii_offset:
.byte REP8(58), REP8(58), REP8(58), REP8(58)
.text
#define WRITE_LINENO \
  vpaddb %xmm0, ASCII_OFFSETx, %xmm1; \
  vpshufb %xmm0, %xmm0, ENDIAN_SHUFFLEx; \
  vmovdqu [OUTPUT_PTR], %xmm0; \
  lea OUTPUT_PTR, [OUTPUT_PTR + LINENO_WIDTH + 1]; \
  mov byte ptr [OUTPUT_PTR - 1], 10  // 10 = newline

// Incrementing the line number is fairly easy: add 1 (in the usual
// binary notation, taken from %xmm3) to the high-decimal number, then
// convert any bytes that produced a carry to high-decimal 0s by
// max-ing with 246.
//
// Normally I'd use a separate constant for this, but there randomly
// happens to be an %xmm register with 246s in its top half already
// (it's intended for an entirely different purpose, but it'll do for
// this one too).
#define INC_LINENO \
  vpaddq %xmm1, %xmm3, %xmm1; vpmaxub %xmm1, LINENO_TOPx, %xmm1

// Avoid modulus tests by unrolling the FizzBuzz by 15. (Bear in mind
// that this starts at 10, not 0, so the pattern will have a different
// phase than usual.)
mov %ecx, GROUPS_OF_15e
fifteen_second_phase_fizzbuzz_lines:
WRITE_BUZZ; INC_LINENO
WRITE_LINENO; INC_LINENO
WRITE_FIZZ; INC_LINENO
WRITE_LINENO; INC_LINENO
WRITE_LINENO; INC_LINENO
WRITE_FIZZBUZZ; INC_LINENO
WRITE_LINENO; INC_LINENO
WRITE_LINENO; INC_LINENO
WRITE_FIZZ; INC_LINENO
WRITE_LINENO; INC_LINENO
WRITE_BUZZ; INC_LINENO
WRITE_FIZZ; INC_LINENO
WRITE_LINENO; INC_LINENO
WRITE_LINENO; INC_LINENO
WRITE_FIZZ; INC_LINENO
dec %ecx
jnz fifteen_second_phase_fizzbuzz_lines

second_phase_increment_width:

lea GROUPS_OF_15e, [GROUPS_OF_15 + GROUPS_OF_15 * 4]
add GROUPS_OF_15e, GROUPS_OF_15e
inc LINENO_WIDTHe

// Increment every element of the low half of ENDIAN_SHUFFLE to
// adjust it for the new width, while leaving the top half unchanged.
vpcmpeqb %xmm0, %xmm0, %xmm0
vpsubb ENDIAN_SHUFFLE, ENDIAN_SHUFFLE, %ymm0

// The second phase handles line numbers with 2 to 5 digits.
cmp LINENO_WIDTHe, 6
jne second_phase_per_width_init

///// The output routine
//
// Most FizzBuzz routines produce output with `write` or a similar
// system call, but these have the disadvantage that they need to copy
// the data being output from userspace into kernelspace. It turns out
// that when running full speed (as seen in the third phase), FizzBuzz
// actually runs faster than `memcpy` does, so `write` and friends are
// unusable when aiming for performance - this program runs five times
// faster than an equivalent that uses `write`-like system calls.
//
// To produce output without losing speed, the program therefore needs
// to avoid copies, or at least do them in parallel with calculating
// the next block of output. This can be accomplished with the
// `vmsplice` system call, which tells the kernel to place a reference
// to a buffer into a pipe (as opposed to copying the data into the
// pipe); the program at the other end of this pipe will then be able
// to read the output directly out of this program's memory, with no
// need to copy the data into kernelspace and then back into
// userspace. In fact, it will be reading out of this program's
// processor's L2 cache, without main memory being touched at all;
// this is the secret to high-performance programming, because the
// cache is much faster than main memory is.
//
// Of course, it's therefore important to avoid changing the output
// buffer until the program connected to standard output has actually
// read it all. This is why the pipe size needed to be set earlier; as
// long as the amount of output is always at least as large as the
// pipe size, successfully outputting one buffer will ensure that none
// of the other buffer is left in the pipe, and thus it's safe to
// overwrite the memory that was previously output. There is some need
// to jump through hoops later on to make sure that `swap_buffers` is
// never called with less than one pipeful of data, but it's worth it
// to get the huge performance boost.

mov %rdx, OUTPUT_PTR
and %edx, (2 << 20) - 1

call swap_buffers
jmp third_phase_init

// Takes the amount of data to output in %rdx, and outputs from the
// buffer containing OUTPUT_PTR.
swap_buffers:
and OUTPUT_PTR, -(2 << 20)  // rewind to the start of the buffer
mov [%rip + iovec_base], OUTPUT_PTR
mov [%rip + iovec_base + 8], %rdx
mov ARG1e, 1
lea ARG2, [%rip + iovec_base]
mov ARG3e, 1
xor ARG4e, ARG4e

// As with most output commands, vmsplice can do a short write
// sometimes, so it needs to be called in a loop in order to ensure
// that all the output is actually sent.
1: mov SYSCALL_NUMBER, __NR_vmsplice
syscall
call exit_on_error
add [ARG2], SYSCALL_RETURN
sub [ARG2 + 8], SYSCALL_RETURN
jnz 1b

xor OUTPUT_PTR, (2 << 20)  // swap to the other buffer
ret


///// Third phase of output
//
// This is the heart of this program. It aims to be able to produce a
// sustained output rate of 64 bytes of FizzBuzz per four clock cycles
// in its main loop (with frequent breaks to do I/O, and rare breaks
// to do more expensive calculations).
//
// The third phase operates primarily using a bytecode interpreter; it
// generates a program in "FizzBuzz bytecode", for which each byte of
// bytecode generates one byte of output. The bytecode language is
// designed so that it can be interpreted using SIMD instructions; 32
// bytes of bytecode can be loaded from memory, interpreted, and have
// its output stored back into memory using just four machine
// instructions. This makes it possible to speed up the FizzBuzz
// calculations by hardcoding some of the calculations into the
// bytecode (this is similar to how JIT compilers can create a version
// of the program with some variables hardcoded, and throw it away on
// the rare occasions that those variables' values change).

third_phase_init:

// Reinitialize ENDIAN_SHUFFLE by copying the initializer stored in
// its high half to both halves. This works in the same way as in the
// second phase.
vpermq ENDIAN_SHUFFLE, ENDIAN_SHUFFLE, 0xEE

// Up to this point, PIPE_SIZE has held the size of the pipe. In order
// to save on registers, the pipe size is from now on encoded via the
// location in which the bytecode program is stored; the bytecode is
// started at iovec_base + PIPE_SIZE (which will be somewhere within
// bytecode_storage), so the same register can be used to find the
// bytecode and to remember the pipe size.
lea %rax, [%rip + iovec_base]
add BYTECODE_START, %rax  // BYTECODE_START is a synonym for PIPE_SIZE

// The bytecode program always holds instructions to produce exactly
// 600 lines of FizzBuzz. At width 6, those come to 3800 bytes long.
lea BYTECODE_END, [BYTECODE_START + 3800]

mov REGEN_TRIGGER, -1  // irrelevant until much later, explained there


third_phase_per_width_init:

// Calculate the amount of output at this LINENO_WIDTH. The result
// will always be divisible by 32, and thus is stored as the number of
// 32-byte units at this width; storing it in bytes would be more
// convenient, but sadly would overflow a 64-bit integer towards the
// end of the program.
lea %ecx, [LINENO_WIDTH * 8 + 47]   // bytes per 15 lines
mov YMMS_AT_WIDTH, GROUPS_OF_15
shr YMMS_AT_WIDTH, 5   // to avoid overflow, divide by 32 first
imul YMMS_AT_WIDTH, %rcx

// This program aims to output 64 bytes of output per four clock
// cycles, which it achieves via a continuous stream of 32-byte writes
// calculated by the bytecode program. One major complication here is
// that the 32-byte writes won't correspond to lines of FizzBuzz; a
// single processor instruction may end up outputting multiple
// different line numbers. So it's no longer possible to have a simple
// line number register, like it was in the second phase.
//
// Instead, the program stores an *approximation* of the line number,
// which is never allowed to differ by 100 or more from the "actual"
// line number; the bytecode program is responsible for fixing up the
// approximation to work out the correct line number to output (this
// allows the same CPU instruction to output digits from multiple
// different line numbers, because the bytecode is being interpreted
// in a SIMD way and thus different parts of the bytecode can fix the
// line number up differently within a single instruction.
//
// The line number is split over three processor registers:
// - LINENO_LOW: stores the line number modulo 200
// - LINENO_MID: stores the hundreds to billions digits
// - LINENO_TOP: stores the ten-billions and more significant digits
// (The parity of the 100s digit is duplicated between LINENO_MID and
// LINENO_LOW; this allows a faster algorithm for LINENO_MID updates.)
//
// Because there's only a need to be within 100 of the real line
// number, the algorithm for updating the line numbers doesn't need to
// run all that often (saving processor cycles); it runs once every
// 512 bytes of output, by simply adding a precalculated value
// (LINENO_LOW_INCR) to LINENO_LOW, then processing the carry to
// LINENO_MID (see later for LINENO_TOP). The amount by which the line
// number increases per 512 bytes of output is not normally going to
// be an integer; LINENO_LOW is therefore stored as a 64-bit fixpoint
// number (in which 2**64 represents "200", e.g. 2**63 would be the
// representation of "the line number is 100 mod 200"), in order to
// delay the accumulation of rounding errors as long as possible. It's
// being stored in a vector register, so there are four copies of its
// value; two of them have 50 (i.e 2**62) added, and two of them have
// 50 subtracted, in order to allow for more efficient code to handle
// the carry to LINENO_MID. Additionally, LINENO_LOW is interpreted as
// a signed number (an older version of this program was better at
// checking for signed than unsigned overflow and I had no reason to
// change).
//
// LINENO_LOW and LINENO_MID are reset every LINENO_WIDTH increase
// (this is because the program can calculate "past" the width
// increase due to not being able to break out of every instruction of
// the main loop, which may cause unwanted carries into LINENO_MID and
// force a reset).

.section .rodata
lineno_low_init:
.byte 0, 0, 0, 0, 0, 0, 0, 192
.byte 0, 0, 0, 0, 0, 0, 0, 64
.byte 0, 0, 0, 0, 0, 0, 0, 192
.byte 0, 0, 0, 0, 0, 0, 0, 64
.text
vmovdqa LINENO_LOW, LINENO_LOW_INIT

// %ecx is the number of bytes in 15 lines. That means that the number
// of 200-line units in 512 bytes is 38.4/%ecx, i.e. 384/(%ecx*10).
// Multiply by 2**64 (i.e. 384*2**64/(%ecx*10) to get LINENO_LOW_INCR.
lea %ecx, [%rcx + %rcx * 4]
add %ecx, %ecx
mov %edx, 384
xor %eax, %eax
div %rcx  // 128-bit divide, %rax = %rdx%rax / %rcx
vpxor LINENO_LOW_INCR, LINENO_LOW_INCR, LINENO_LOW_INCR
vpinsrq LINENO_LOW_INCRx, LINENO_LOW_INCRx, %rax, 0
vpermq LINENO_LOW_INCR, LINENO_LOW_INCR, 0

// LINENO_MID is almost stored in high-decimal, as four eight-digit
// numbers. However, the number represented is the closest line number
// that's 50 mod 100, stored as the two closest multiples of 100 (e.g.
// if the true line number is 235, it's approximated as 250 and then
// stored using the representations for 200 and 300), which is why
// LINENO_LOW needs the offsets of 50 and -50 to easily do a carry. A
// ymm vector holds four 64-bit numbers, two of which hold the value
// that's 0 mod 200, two which hold the value that's 100 mod 200. So
// carries on it are handled using a vector of mostly 246s, with 247s
// in the two locations which are always odd.
.section .rodata
lineno_mid_base:
.byte 246, 246, 246, 246, 246, 246, 246, 246
.byte 247, 246, 246, 246, 246, 246, 246, 246
.byte 246, 246, 246, 246, 246, 246, 246, 246
.byte 247, 246, 246, 246, 246, 246, 246, 246
.text

// This code is some fairly complex vector manipulation to initialise
// LINENO_MID to a power of 10 (handling the case where LINENO_WIDTH
// is so high that the hundreds to billions digits are all zeroes).
mov %edx, 1
mov %eax, 11
sub %eax, LINENO_WIDTHe
cmovbe %eax, %edx
shl %eax, 3
vpxor %xmm0, %xmm0, %xmm0
vpinsrq %xmm0, %xmm0, %rax, 0
vpermq %ymm0, %ymm0, 0
vpcmpeqb LINENO_MID, LINENO_MID, LINENO_MID
vpsrlq LINENO_MID, LINENO_MID, %xmm0
vpmaxub LINENO_MID, LINENO_MID_BASE, LINENO_MID
vpermq %ymm0, LINENO_MID_BASE, 0x55
vpsubb %ymm0, %ymm0, LINENO_MID_BASE
vpaddq LINENO_MID, LINENO_MID, %ymm0
vpmaxub LINENO_MID, LINENO_MID_BASE, LINENO_MID

// LINENO_TOP doesn't need to be initialized for new widths, because
// an overrun by 100 lines is possible, but by 10 billion lines isn't.
// The format consists of two 64-bit sections that hold high-decimal
// numbers (these are always the same as each other), and two that
// hold constants that are used by the bytecode generator.
.section .rodata
lineno_top_init:
.byte 198, 197, 196, 195, 194, 193, 192, 191
.byte 246, 246, 246, 246, 246, 246, 246, 246
.byte 190, 189, 188, 187, 186, 185, 184, 183
.byte 246, 246, 246, 246, 246, 246, 246, 246
.text

// When moving onto a new width, start at the start of the bytecode
// program.
mov BYTECODE_IP, BYTECODE_START


// Generating the bytecode program
//
// The bytecode format is very simple (in order to allow it to be
// interpreted in just a couple of machine instructions):
// - A negative byte represents a literal character (e.g. to produce
//   a literal 'F', you use the bytecode -'F', i.e. -70 = 0xba)
// - A byte 0..7 represents the hundreds..billions digit of the line
//   number respectively, and asserts that the hundreds digit of the
//   line number is even
// - A byte 8..15 represents the hundreds..billions digit of the line
//   number respectively, and asserts that the hundreds digit of the
//   line number is odd
//
// In other words, the bytecode program only ever needs to read from
// LINENO_MID; the information stored in LINENO_LOW and LINENO_TOP
// therefore has to be hardcoded into it. The program therefore needs
// to be able to generate 600 lines of output (as the smallest number
// that's divisible by 100 to be able to hardcode the two low digits,
// 200 to be able to get the assertions about the hundreds digits
// correct, and 3 and 5 to get the Fizzes and Buzzes in the right
// place).

generate_bytecode:

mov BYTECODE_GEN_PTR, BYTECODE_START

// FIZZ and BUZZ work just like in the second phase, except that they
// are now bytecode programs rather than ASCII.
mov FIZZ, 0xf6868697ba  // -"Fizz\n"
mov BUZZ, 0xf686868bbe  // -"Buzz\n"

// %ymm2 holds the bytecode for outputting the hundreds and more
// significant digits of a line number. The most significant digits of
// this can be obtained by converting LINENO_TOP from high-decimal to
// the corresponding bytecode, which is accomplished by subtracting
// from 198 (i.e. 256 - 10 - '0'). The constant parts of LINENO_TOP
// are 198 minus the bytecode for outputting the hundreds to billions
// digit of a number; this makes it possible for a single endian
// shuffle to deal with all 16 of the mid and high digits at once.
.section .rodata
bascii_offset:
.byte REP8(198), REP8(198), REP8(198), REP8(198)
.text
vpsubb %ymm2, BASCII_OFFSET, LINENO_TOP
vpshufb %ymm2, %ymm2, ENDIAN_SHUFFLE

#define GEN_FIZZ  mov [BYTECODE_GEN_PTR], FIZZ; add BYTECODE_GEN_PTR, 5
#define GEN_BUZZ  mov [BYTECODE_GEN_PTR], BUZZ; add BYTECODE_GEN_PTR, 5
#define GEN_FIZZBUZZ \
  mov [BYTECODE_GEN_PTR], FIZZe; \
  mov [BYTECODE_GEN_PTR + 4], BUZZ; add BYTECODE_GEN_PTR, 9
#define GEN_LINENO(units_digit) \
  vmovdqu [BYTECODE_GEN_PTR], %xmm2; \
  lea BYTECODE_GEN_PTR, [BYTECODE_GEN_PTR + LINENO_WIDTH + 1]; \
  mov [BYTECODE_GEN_PTR - 3], %al; \
  mov word ptr [BYTECODE_GEN_PTR - 2], 0xf6d0 - units_digit

// The bytecode generation loop is unrolled to depth 30, allowing the
// units digits to be hardcoded. The tens digit is stored in %al, and
// incremented every ten lines of output. The parity of the hundreds
// digit is stored in %ymm2: one half predicts the hundreds digit to
// be even, the other to be odd, and the halves are swapped every time
// the tens digit carries (ensuring the predictions are correct).
mov %eax, 0xd0
jmp 2f
inc_tens_digit:
cmp %al, 0xc7
je 1f  // jumps every 10th execution, therefore predicts perfectly
dec %eax
ret
1: mov %eax, 0xd0
vpermq %ymm2, %ymm2, 0x4e
ret

2: mov %ecx, 20
thirty_bytecode_lines:
GEN_BUZZ
GEN_LINENO(1)
GEN_FIZZ
GEN_LINENO(3)
GEN_LINENO(4)
GEN_FIZZBUZZ
GEN_LINENO(6)
GEN_LINENO(7)
GEN_FIZZ
GEN_LINENO(9)
call inc_tens_digit
GEN_BUZZ
GEN_FIZZ
GEN_LINENO(2)
GEN_LINENO(3)
GEN_FIZZ
GEN_BUZZ
GEN_LINENO(6)
GEN_FIZZ
GEN_LINENO(8)
GEN_LINENO(9)
call inc_tens_digit
GEN_FIZZBUZZ
GEN_LINENO(1)
GEN_LINENO(2)
GEN_FIZZ
GEN_LINENO(4)
GEN_BUZZ
GEN_FIZZ
GEN_LINENO(7)
GEN_LINENO(8)
GEN_FIZZ
call inc_tens_digit
dec %ecx
jnz thirty_bytecode_lines

generate_bytecode_overrun_area:

// Duplicate the first 512 bytes of the bytecode program at the end,
// so that there's no need to check to see whether BYTECODE_IP needs
// to be looped back to the start of the program any more than once
// per 512 bytes
mov %rax, BYTECODE_START
#define COPY_64_BYTECODE_BYTES(offset) \
  vmovdqa %ymm0, [%rax + offset]; \
  vmovdqa %ymm3, [%rax + (offset + 32)]; \
  vmovdqu [BYTECODE_GEN_PTR + offset], %ymm0; \
  vmovdqu [BYTECODE_GEN_PTR + (offset + 32)], %ymm3
COPY_64_BYTECODE_BYTES(0)
COPY_64_BYTECODE_BYTES(64)
COPY_64_BYTECODE_BYTES(128)
COPY_64_BYTECODE_BYTES(192)
COPY_64_BYTECODE_BYTES(256)
COPY_64_BYTECODE_BYTES(320)
COPY_64_BYTECODE_BYTES(384)
COPY_64_BYTECODE_BYTES(448)


// Preparing for the main loop
//
// Work out how long the main loop is going to iterate for.
// OUTPUT_LIMIT holds the address just beyond the end of the output
// that the main loop should produce. The aim here is to produce
// exactly one pipeful of data if possible, but to stop earlier if
// there's a change in digit width (because any output beyond that
// point will be useless: the bytecode will give it the wrong number
// of digits).
calculate_main_loop_iterations:

// Extract the pipe size from BYTECODE_START, in 32-byte units.
// During this calculation, OUTPUT_LIMIT holds the amount of output
// produced, rather than an address like normal.
mov OUTPUT_LIMIT, BYTECODE_START
lea %rdx, [%rip + iovec_base]
sub OUTPUT_LIMIT, %rdx
shr OUTPUT_LIMIT, 5

// Reduce the output limit to the end of this width, if it would be
// higher than that.
cmp OUTPUT_LIMIT, YMMS_AT_WIDTH
cmovae OUTPUT_LIMIT, YMMS_AT_WIDTH

// If there's already some output in the buffer, reduce the amount
// of additional output produced accordingly (whilst ensuring that
// a multiple of 512 bytes of output is produced).
//
// This would be buggy if the YMMS_AT_WIDTH limit were hit at the
// same time, but that never occurs as it would require two width
// changes within one pipeful of each other, and 9000000 lines of
// FizzBuzz is much more than a pipeful in size.
mov %rax, OUTPUT_PTR
and %eax, ((2 << 20) - 1) & -512
shr %eax, 5
sub OUTPUT_LIMIT, %rax

// The amount of output to produce is available now, and won't be
// later, so subtract it from the amount of output that needs to
// be produced now.
sub YMMS_AT_WIDTH, OUTPUT_LIMIT

// Return OUTPUT_LIMIT back to being a pointer, not an amount.
shl OUTPUT_LIMIT, 5
add OUTPUT_LIMIT, OUTPUT_PTR

prepare_main_loop_invariants:

// To save one instruction in the bytecode interpreter (which is very
// valuable, as it runs every second CPU cycle), LINENO_MID_TEMP is
// used to store a reformatted version of LINENO_MID, in which each
// byte is translated from high-decimal to ASCII, and the bytecode
// command that would access that byte is added to the result (e.g.
// the thousands digit for the hundreds-digits-odd version has 10
// added to convert from high-decimal to a pure number, '0' added to
// convert to ASCII, then 9 added because that's the bytecode command
// to access the thousands digit when the hundreds digit is odd, so
// the amount added is 10 + '0' + 9 = 57).
//
// LINENO_MID_TEMP is updated within the main loop, immediately after
// updating LINENO_MID, but because the bytecode interpreter reads
// from it it needs a valid value at the start of the loop.
.section .rodata
biascii_offset:
.byte 58, 59, 60, 61, 62, 63, 64, 65
.byte 66, 67, 68, 69, 70, 71, 72, 73
.byte 58, 59, 60, 61, 62, 63, 64, 65
.byte 66, 67, 68, 69, 70, 71, 72, 73
.text
vpaddb LINENO_MID_TEMP, BIASCII_OFFSET, LINENO_MID

// To save an instruction, precalculate minus the length of the
// bytecode. (Although the value of this is determined entirely by
// LINENO_WIDTH, the register it's stored in gets clobbered by
// system calls and thus needs to be recalculated each time.)
mov BYTECODE_NEG_LEN, BYTECODE_START
sub BYTECODE_NEG_LEN, BYTECODE_END


// The main loop

// The bytecode interpreter consists of four instructions:
// 1. Load the bytecode from memory into %ymm2;
// 2. Use it as a shuffle mask to shuffle LINENO_MID_TEMP;
// 3. Subtract the bytecode from the shuffle result;
// 4. Output the result of the subtraction.
//
// To see why this works, consider two cases. If the bytecode wants to
// output a literal character, then the shuffle will produce 0 for
// that byte (in AVX2, a shuffle with a a negative index produces an
// output of 0), and subtracting the bytecode from 0 then produces the
// character (because the bytecode encoded minus the character). If
// the bytecode instead wants to output a digit, then the shuffle will
// fetch the relevant digit from LINENO_MID_TEMP (which is the desired
// ASCII character plus the bytecode instruction that produces it),
// and subtract the bytecode instruction to just produce the character
// on its own.
//
// This produces an exactly correct line number as long as the line
// number approximation is within 100 of the true value: it will be
// correct as long as the relevant part of LINENO_MID is correct, and
// the worst case is for LINENO_MID to be storing, say, 200 and 300
// (the representation of 250) when the true line number is 400. The
// value in LINENO_MID specifically can be up to 50 away from the
// value of the line number as recorded by LINENO_MID and LINENO_LOW
// together, so as long as the line number registers are within 100,
// LINENO_MID will be within 150 (which is what is required).
//
// This doesn't update the bytecode instruction pointer or the pointer
// into the output buffer; those are updated once every 512 bytes (and
// to "advance the instruction pointer" the rest of the time, the main
// loop is unrolled, using hardcoded offsets with the pointer updates
// baked in).
//
// The bytecode instruction pointer itself is read from %rdx, not
// BYTECODE_IP, so that mid-loop arithmetic on BYTECODE_IP won't cause
// the interpreter to break.
//
// It's important to note one potential performance issue with this
// code: the read of the bytecode from memory is not only misalignable
// (`vmovdqu`); it splits a cache line 3/8 of the time. This causes L1
// split-load penalties on the 3/8 cycles where it occurs. I am not
// sure whether this actually reduces the program's performance in
// practice, or whether the split loads can be absorbed while waiting
// for writes to go through to the L2 cache. However, even if it does
// have a genuine performance cost, it seems like the least costly way
// to read the bytecode; structuring the bytecode to avoid split loads
// makes it take up substantially more memory, and the less cache that
// is used for the bytecode, the more that can be used for the output
// buffers. (In particular, increasing the bytecode to 2400 lines so
// that it's available at all four of the alignments required of it
// does not gain, because it then becomes so large that the processor
// struggles to keep it in L1 cache - it only just fits, and there
// isn't any way for it to know which parts of the cache are meant to
// stay in L1 and which are meant to leave to L2, so there's a large
// slowdown when it guesses wrong.)
#define INTERPRET_BYTECODE(bc_offset, buf_offset) \
  vmovdqu %ymm2, [%rdx + bc_offset]; \
  vpshufb %ymm0, LINENO_MID_TEMP, %ymm2; \
  vpsubb %ymm0, %ymm0, %ymm2; \
  vmovdqa [OUTPUT_PTR + buf_offset], %ymm0

// The main loop itself consists of sixteen uses of the bytecode
// interpreter, interleaved (to give the reorder buffer maximum
// flexibility) with all the other logic needed in the main loop.
// (Most modern processors can handle 4-6 instructions per clock cycle
// as long as they don't step on each others' toes; thus this loop's
// performance will be limited by the throughput of the L2 cache, with
// all the other work (bytecode interpretation, instruction decoding,
// miscellaneous other instructions, etc.) fitting into the gaps while
// the processor is waiting for the L2 cache to do its work.)

.p2align 5
main_loop:
// %rdx caches BYTECODE_IP's value at the start of the loop
mov %rdx, BYTECODE_IP
INTERPRET_BYTECODE(0, 0)

// %ymm1 caches LINENO_LOW's value at the start of the loop
vmovdqa %ymm1, LINENO_LOW
INTERPRET_BYTECODE(32, 32)

// Add LINENO_LOW_INCR to LINENO_LOW, checking for carry; it carried
// if the sign bit changed from 0 to 1. (vpandn is unintuitive; this
// is ~%ymm1 & LINENO_LOW, not %ymm1 & ~LINENO_LOW like the name
// suggests.)
vpaddq LINENO_LOW, LINENO_LOW_INCR, LINENO_LOW
INTERPRET_BYTECODE(64, 64)

vpandn %ymm3, %ymm1, LINENO_LOW
INTERPRET_BYTECODE(96, 96)

vpsrlq %ymm3, %ymm3, 63
INTERPRET_BYTECODE(128, 128)

// Add the carry to LINENO_MID (doubling it; LINENO_MID counts in
// units of 100 but a LINENO_LOW carry means 200).
vpaddb %ymm3, %ymm3, %ymm3
INTERPRET_BYTECODE(160, 160)

vpaddq LINENO_MID, LINENO_MID, %ymm3
INTERPRET_BYTECODE(192, 192)

vpmaxub LINENO_MID, LINENO_MID_BASE, LINENO_MID
INTERPRET_BYTECODE(224, 224)

// Update LINENO_MID_TEMP with the new value from LINENO_MID; this is
// the point at which the new value takes effect. This is done at the
// exact midpoint of the loop, in order to reduce the errors from
// updating once every 512 bytes as far as possible.
vpaddb LINENO_MID_TEMP, BIASCII_OFFSET, LINENO_MID
INTERPRET_BYTECODE(256, 256)

// Update the output and bytecode instruction pointers. The change to
// the output pointer kicks in immediately, but is cancelled out via
// the use of a negative offset until the end of the loop.
add OUTPUT_PTR, 512
INTERPRET_BYTECODE(288, -224)

add BYTECODE_IP, 512
INTERPRET_BYTECODE(320, -192)

// The change to the bytecode instruction pointer doesn't kick in
// immediately, because it might need to wrap back to the start (this
// can be done by adding BYTECODE_NEG_LEN to it); this is why the
// interpreter has a cached copy of it in %rdx.
lea %rax, [BYTECODE_IP + BYTECODE_NEG_LEN]
INTERPRET_BYTECODE(352, -160)

INTERPRET_BYTECODE(384, -128)
// Some modern processors can optimise `cmp` better if it appears
// immediately before the command that uses the comparison result, so
// a couple of commands have been moved slightly to put the `cmp` next
// to the use of its result. With modern out-of-order processors,
// there is only a marginal advantage to manually interleaving the
// instructions being used, and the `cmp` advantage outweighs that.
cmp BYTECODE_IP, BYTECODE_END

cmovae BYTECODE_IP, %rax
INTERPRET_BYTECODE(416, -96)

INTERPRET_BYTECODE(448, -64)

INTERPRET_BYTECODE(480, -32)
cmp OUTPUT_PTR, OUTPUT_LIMIT
jb main_loop

after_main_loop:
// There are two reasons the main loop might terminate: either there's
// a pipeful of output, or the line number has increased in width
// (forcing the generaion of new bytecode to put more digits in the
// numbers being printed). In the latter case, a) the output may have
// overrun slightly, and OUTPUT_PTR needs to be moved back to
// OUTPUT_LIMIT:
mov OUTPUT_PTR, OUTPUT_LIMIT
// and b) there may be less than a pipeful of output, in which case it
// wouldn't be safe to output it and the swap_buffers call needs to be
// skipped. Calculate the pipe size into %rax, the amount of output
// into %rdx (swap_buffers needs it there anyway), and compare.
lea %rax, [%rip + iovec_base]
sub %rax, BYTECODE_START
neg %eax
mov %rdx, OUTPUT_PTR
and %edx, (2 << 20) - 1
cmp %edx, %eax
jb 1f
call swap_buffers

// If all the lines at this width have been exhausted, move to the
// next width.
1: test YMMS_AT_WIDTH, YMMS_AT_WIDTH
jnz check_lineno_top_carry

cmp LINENO_WIDTHe, 18  // third phase handles at most 18 digits
je fourth_phase

inc LINENO_WIDTHe
vpcmpeqb %ymm0, %ymm0, %ymm0
vpsubb ENDIAN_SHUFFLE, ENDIAN_SHUFFLE, %ymm0

lea GROUPS_OF_15, [GROUPS_OF_15 + GROUPS_OF_15 * 4]
add GROUPS_OF_15, GROUPS_OF_15

add BYTECODE_END, 320

jmp third_phase_per_width_init

// So far, the code has kept LINENO_MID and LINENO_LOW updated, but
// not LINENO_TOP. Because 10 billion lines of FizzBuzz don't normally
// have a length that's divisible by 512 (and indeed, vary in size a
// little because 10 billion isn't divisible by 15), it's possible for
// the 10-billions and higher digits to need to change in the middle
// of a main loop iteration - indeed, even in the middle of a single
// CPU instruction!
//
// It turns out that when discussing the line number registers above,
// I lied a little about the format. The bottom seven bytes of
// LINENO_MID do indeed represent the hundreds to hundred millions
// digits. However, the eighth changes in meaning over the course of
// the program. It does indeed represent the billions digit most of
// the time; but when the line number is getting close to a multiple
// of 10 billion, the billions and hundred-millions digits will always
// be the same as each other (either both 9s or both 0s). When this
// happens, the format changes: the hundred-millions digit of
// LINENO_MID represents *both* the hundred-millions and billions
// digits of the line number, and the top byte then represents the
// ten-billions digit. Because incrementing a number causes a row of
// consecutive 9s to either stay untouched, or all roll over to 0s at
// once, this effectively lets us do maths on more than 8 digits,
// meaning that the normal arithmetic code within the main loop can
// handle the ten-billions digit in addition to the digits below.
//
// Of course, the number printing code also needs to handle the new
// representation, but the number printing is done by a bytecode
// program, which can be made to output some of the digits being
// printed multiple times by repeating "print digit from LINENO_MID"
// commands within it. Those commands are generated from COUNTER_TOP
// anyway, so the program just changes the constant portion of
// COUNTER_TOP (and moves print-digit commands into the top half) in
// order to produce the appropriate bytecode changes.
//
// A similar method is used to handle carries in the hundred-billions,
// trillions, etc. digits.
//
// Incidentally, did you notice the apparent off-by-one in the
// initialisation of LINENO_MID within third_phase_per_width_init? It
// causes the "billions" digit to be initialised to 1 (not 0) when the
// line number width is 11 or higher. That's because the alternate
// representation will be in use during a line number width change (as
// higher powers of 10 are close to multiples of 10 billion), so the
// digit that's represented by that byte of LINENO_MID genuinely is a
// 1 rather than a 0.
check_lineno_top_carry:

// The condition to change line number format is:
// a) The line number is in normal format, and the hundred-millions
//    and billions digits are both 9; or
// b) The line number is in alternate format, and the hundred-millions
//    digit is 0.
// To avoid branchy code in the common case (when no format change is
// needed), REGEN_TRIGGER is used to store the specific values of the
// hundred-millions and billions digits that mean a change is needed,
// formatted as two repeats of billions, hundred-millions, 9, 9 in
// high-decimal (thus, when using normal format, REGEN_TRIGGER is
// high-decimal 99999999, i.e. -1 when interpreted as binary). The 9s
// are because vpshufd doesn't have very good resolution: the millions
// and ten-millions digits get read too, but can simply just be masked
// out. The two repeats are to ensure that both halves of LINENO_MID
// (the even-hundreds-digit and odd-hundreds-digit halves) have the
// correct value while changing (changing the format while half the
// register still ended ...98999999 would produce incorrect output).
vpshufd %xmm0, LINENO_MIDx, 0xED
vpextrq %rax, %xmm0, 0
mov %rdx, 0x0000ffff0000ffff
or %rax, %rdx
cmp %rax, REGEN_TRIGGER
jne calculate_main_loop_iterations

cmp REGEN_TRIGGER, -1
jne switch_to_normal_representation


switch_to_alternate_representation:
// Count the number of 9s at the end of LINENO_TOP. To fix an edge
// case, the top bit of LINENO_TOP is interpreted as a 0, preventing
// a 9 being recognised there (this causes 10**18-1 to increment to
// 10**17 rather than 10**18, but the program immediately exits
// before this can become a problem).
vpextrq %rdx, LINENO_TOPx, 1
mov SPILL, %rdx
shl %rdx, 1
shr %rdx, 1
not %rdx
bsf %rcx, %rdx
and %rcx, -8

// Change the format of LINENO_TOP so that the digit above the
// consecutive 9s becomes a reference to the top byte of LINENO_MID,
// and the 9s themselves references to the hundred-millions digit.
// This is done via a lookup table that specifies how to move the
// bytes around.
.section .rodata
alternate_representation_lookup_table:
.byte 0, 1, 2, 3, 4, 5, 6, 6
.byte 7, 9, 10, 11, 12, 13, 14, 15
.byte 0, 1, 2, 3, 4, 5, 6, 6
.byte 7, 9, 10, 11, 12, 13, 14, 15

.byte 0, 1, 2, 3, 4, 5, 6, 6
.byte 6, 7, 10, 11, 12, 13, 14, 15
.byte 0, 1, 2, 3, 4, 5, 6, 6
.byte 6, 7, 10, 11, 12, 13, 14, 15

.byte 0, 1, 2, 3, 4, 5, 6, 6
.byte 6, 6, 7, 11, 12, 13, 14, 15
.byte 0, 1, 2, 3, 4, 5, 6, 6
.byte 6, 6, 7, 11, 12, 13, 14, 15

.byte 0, 1, 2, 3, 4, 5, 6, 6
.byte 6, 6, 6, 7, 12, 13, 14, 15
.byte 0, 1, 2, 3, 4, 5, 6, 6
.byte 6, 6, 6, 7, 12, 13, 14, 15

.byte 0, 1, 2, 3, 4, 5, 6, 6
.byte 6, 6, 6, 6, 7, 13, 14, 15
.byte 0, 1, 2, 3, 4, 5, 6, 6
.byte 6, 6, 6, 6, 7, 13, 14, 15

.byte 0, 1, 2, 3, 4, 5, 6, 6
.byte 6, 6, 6, 6, 6, 7, 14, 15
.byte 0, 1, 2, 3, 4, 5, 6, 6
.byte 6, 6, 6, 6, 6, 7, 14, 15

.byte 0, 1, 2, 3, 4, 5, 6, 6
.byte 6, 6, 6, 6, 6, 6, 7, 15
.byte 0, 1, 2, 3, 4, 5, 6, 6
.byte 6, 6, 6, 6, 6, 6, 7, 15

.byte 0, 1, 2, 3, 4, 5, 6, 6
.byte 6, 6, 6, 6, 6, 6, 6, 7
.byte 0, 1, 2, 3, 4, 5, 6, 6
.byte 6, 6, 6, 6, 6, 6, 6, 7
.text

lea %rax, [%rip + alternate_representation_lookup_table]
vpshufb LINENO_TOP, LINENO_TOP, [%rax + 4 * %rcx]

// The top byte of LINENO_MID also needs the appropriate digit of
// LINENO_TOP placed there.
mov %rdx, SPILL
shr %rdx, %cl
vpinsrb LINENO_MIDx, LINENO_MIDx, %edx, 7
vpinsrb LINENO_MIDx, LINENO_MIDx, %edx, 15
vpermq LINENO_MID, LINENO_MID, 0x44

// Finally, REGEN_TRIGGER needs to store the pattern of digits that
// will prompt a shift back to the normal representation (the hundred-
// millions digit must be 0, and the value of the billions digit will
// be predictable).
inc %edx
shl %edx, 24
or %edx, 0xF6FFFF
mov REGEN_TRIGGERe, %edx
shl %rdx, 32
or REGEN_TRIGGER, %rdx
jmp generate_bytecode


switch_to_normal_representation:
// Switching back is fairly easy: LINENO_TOP can almost be converted
// back into its usual format by running the bytecode program stored
// there to remove any unusual references into LINENO_MID, then
// restoring the usual references manually. Running the program will
// unfortunately convert high-decimal to ASCII (or in this case zeroes
// because there's no need to do the subtraction), but that can be
// worked around by taking the bytewise maximum of the converted and
// original LINENO_TOP values (high-decimal is higher than bytecode
// references and much higher than zero).
vpsubb %ymm2, BASCII_OFFSET, LINENO_TOP
vpshufb %ymm0, LINENO_MID, %ymm2
vpmaxub LINENO_TOP, LINENO_TOP, %ymm0

// Manually fix the constant parts of lineno_top to contain their
// usual constant values
.section .rodata
lineno_top_max:
.byte 198, 197, 196, 195, 194, 193, 192, 191
.byte 255, 255, 255, 255, 255, 255, 255, 255
.byte 190, 189, 188, 187, 186, 185, 184, 183
.byte 255, 255, 255, 255, 255, 255, 255, 255
.text
vpminub LINENO_TOP, LINENO_TOP_MAX, LINENO_TOP

// The billions digit of LINENO_MID needs to be set back to 0 (which
// is its true value at this point: the same as the hundred-thousands
// digit, which is also 0).
vpsllq LINENO_MID, LINENO_MID, 8
vpsrlq LINENO_MID, LINENO_MID, 8
vpmaxub LINENO_MID, LINENO_MID_BASE, LINENO_MID

mov REGEN_TRIGGER, -1

jmp generate_bytecode


///// Fourth phase
//
// Ending at 999999999999999999 lines would be a little unsatisfying,
// so here's a routine to write the quintillionth line and exit.
//
// It's a "Buzz", which we can steal from the first phase's constant.

fourth_phase:

mov ARG1e, 1
lea ARG2, [%rip + fizzbuzz_intro + 11]
mov ARG3, 5
mov SYSCALL_NUMBER, __NR_write
syscall
call exit_on_error
xor ARG1e, ARG1e
jmp exit


///// Error handling code
//
// This doesn't run in a normal execution of the program, and isn't
// particularly optimised; I didn't comment it much because it isn't
// very interesting and also is fairly self-explanatory.

write_stderr:
mov ARG1e, 2
mov SYSCALL_NUMBER, __NR_write
syscall
ret

inefficiently_write_as_hex:
push %rax
push %rcx
shr %rax, %cl
and %rax, 0xF
.section .rodata
hexdigits: .ascii "0123456789ABCDEF"
.text
lea %rcx, [%rip + hexdigits]
movzx %rax, byte ptr [%rcx + %rax]
mov [%rip + error_write_buffer], %al
lea ARG2, [%rip + error_write_buffer]
mov ARG3e, 1
call write_stderr
pop %rcx
pop %rax
sub %ecx, 4
jns inefficiently_write_as_hex
ret

exit_on_error:
test SYSCALL_RETURN, SYSCALL_RETURN
js 1f
ret

.section .rodata
error_message_part_1: .ascii "Encountered OS error 0x"
error_message_part_2: .ascii " at RIP 0x"
error_message_part_3: .ascii ", exiting program.\n"
.text

1: push SYSCALL_RETURN
lea ARG2, [%rip + error_message_part_1]
mov ARG3e, 23
call write_stderr
pop SYSCALL_RETURN
neg SYSCALL_RETURN
mov %rcx, 8
call inefficiently_write_as_hex
lea ARG2, [%rip + error_message_part_2]
mov ARG3e, 10
call write_stderr
pop %rax  // find the caller's %rip from the stack
sub %rax, 5  // `call exit_on_error` compiles to 5 bytes
mov %rcx, 60
call inefficiently_write_as_hex
lea ARG2, [%rip + error_message_part_3]
mov ARG3e, 19
call write_stderr
mov ARG1e, 74
// fall through

exit:
mov SYSCALL_NUMBER, __NR_exit_group
syscall
ud2

.section .rodata
cpuid_error_message:
.ascii "Error: your CPUID command does not support command "
.ascii "0x80000006 (AMD-style L2 cache information).\n"
.text
bad_cpuid_error:
lea ARG2, [%rip + cpuid_error_message]
mov ARG3e, 96
call write_stderr
mov ARG1e, 59
jmp exit

.section .rodata
pipe_error_message:
.ascii "This program can only output to a pipe "
.ascii "(try piping into `cat`?)\n"
.text
pipe_error:
lea ARG2, [%rip + pipe_error_message]
mov ARG3e, 64
call write_stderr
mov ARG1e, 73
jmp exit

.section .rodata
pipe_perm_error_message_part_1:
.ascii "Cannot allocate a sufficiently large kernel buffer.\n"
.ascii "Try setting /proc/sys/fs/pipe-max-size to 0x"
pipe_perm_error_message_part_2: .ascii ".\n"
.text
pipe_perm_error:
lea ARG2, [%rip + pipe_perm_error_message_part_1]
mov ARG3e, 96
call write_stderr
mov %rax, PIPE_SIZE
mov %ecx, 28
call inefficiently_write_as_hex
lea ARG2, [%rip + pipe_perm_error_message_part_2]
mov ARG3e, 2
call write_stderr
mov ARG1e, 77
jmp exit

.section .rodata
pipe_size_error_message_part_1:
.ascii "Failed to resize the kernel pipe buffer.\n"
.ascii "Requested size: 0x"
pipe_size_error_message_part_2: .ascii "\nActual size: 0x"
pipe_size_error_message_part_3:
.ascii "\n(If the buffer is too large, this may cause errors;"
.ascii "\nthe program could run too far ahead and overwrite"
.ascii "\nmemory before it had been read from.)\n"
.text
pipe_size_mismatch_error:
push SYSCALL_RETURN
lea ARG2, [%rip + pipe_size_error_message_part_1]
mov ARG3e, 59
call write_stderr
mov %rax, PIPE_SIZE
mov %ecx, 28
call inefficiently_write_as_hex
lea ARG2, [%rip + pipe_size_error_message_part_2]
mov ARG3e, 16
call write_stderr
pop %rax
mov %ecx, 28
call inefficiently_write_as_hex
lea ARG2, [%rip + pipe_size_error_message_part_3]
mov ARG3e, 141
call write_stderr
mov ARG1e, 73
jmp exit

Wow, I have no words. This is really cool. Obviously didn't get a chance to understand much or any of it yet, but running it I'm getting 41-45GiB/s. This is almost an order of magnitude faster than most of the other submissions. Amazing work. — Omer Tuchfeld, Commented Oct 26, 2021 at 10:28
Okay now that I actually bothered to read everything you wrote rather than hurrying up to try it out, I saw the taskset note. I'm getting 56GiB/s doing that! — Omer Tuchfeld, Commented Oct 26, 2021 at 10:38
@StevenLu all Apple M1 chips are ARM chips and thus have a different instruction set to x86-64 with AVX2 extensions used here. — andrybak, Commented Oct 29, 2021 at 9:24
I read the whole thing over two days and learned more than I did in some college classes. Truly beautiful. Thank you sincerely for putting such a ludicrous amount of effort and expertise into this. — Personman, Commented Nov 22, 2021 at 6:18

Isaac G. · Accepted Answer · 2020-11-15 08:30:18Z

After much trial and error, with the goal of not resorting to Assembly while achieving the best single-threaded performance, this is my entry:

#include <unistd.h>

#define unlikely(e)      __builtin_expect((e), 0)
#define mcpy(d, s, n)    __builtin_memcpy((d), (s), (n))

#define CACHELINE 64
#define PGSIZE 4096
#define ALIGNED_BUF 65536
#define FIZZ "Fizz"
#define BUZZ "Buzz"
#define DELIM "\n"

typedef struct {
    unsigned char offset;
    char data[CACHELINE - sizeof(unsigned char)];
} counters_t;

static inline void os_write(int out, void *buf, unsigned int n)
{
    while (n)
    {
        ssize_t written = write(out, buf, n);

        if (written >= 0)
        {
            buf += written;
            n -= written;
        }
    }
}

int main(void)
{
    const int out = 1;

    __attribute__((aligned(CACHELINE))) static counters_t counter = {
        sizeof(counter.data) - 1, "00000000000000000000000000000000000000000000000000000000000000"
    };
    __attribute__((aligned(PGSIZE))) static char buf[ALIGNED_BUF + (sizeof(counter.data) * 15 * 3)] = { 0 };
    char *off = buf;

    for (;;)
    {
        // Write chunks of 30 counters until we reach `ALIGNED_BUF`
        while (off - buf < ALIGNED_BUF)
        {
            #define NN (sizeof(counter.data) - 2)

            // Hand-rolled counter copy, because with non-constant sizes the compiler
            // just calls memcpy, which is too much overhead.
            #define CTRCPY(i) do { \
                const char *src = end; \
                char *dst = off; \
                unsigned _n = n; \
                switch (_n & 3) { \
                case 3: *dst++ = *src++; \
                case 2: \
                    mcpy(dst, src, 2); \
                    dst += 2; src += 2; \
                    break; \
                case 1: *dst++ = *src++; \
                case 0: break; \
                } \
                for (_n &= ~3; _n; _n -= 4, dst += 4, src += 4) { \
                    mcpy(dst, src, 4); \
                } \
                mcpy(off + n, i DELIM, sizeof(i DELIM) - 1); \
                off += n + sizeof(i DELIM) - 1; \
            } while (0)

            // Write the first 10 counters of the group (need to separate the
            // first 10 counters from the rest of the chunk due to possible decimal
            // order increase at the end of this block)
            {
                const char *const end = counter.data + counter.offset;
                const unsigned int n = sizeof(counter.data) - counter.offset - 1;

                CTRCPY("1"); // 1
                CTRCPY("2"); // 2

                mcpy(off, FIZZ DELIM, sizeof(FIZZ DELIM) - 1); // Fizz (3)
                off += sizeof(FIZZ DELIM) - 1;

                CTRCPY("4"); // 4

                mcpy(off, BUZZ DELIM FIZZ DELIM, sizeof(BUZZ DELIM FIZZ DELIM) - 1); // Buzz (5) Fizz (6)
                off += sizeof(BUZZ DELIM FIZZ DELIM) - 1;

                CTRCPY("7"); // 7
                CTRCPY("8"); // 8

                mcpy(off, FIZZ DELIM BUZZ DELIM, sizeof(FIZZ DELIM BUZZ DELIM) - 1); // Fizz (9) Buzz (10)
                off += sizeof(FIZZ DELIM BUZZ DELIM) - 1;

                // Carry handling on MOD 10
                for (unsigned d = NN; ; --d)
                {
                    if (counter.data[d] != '9')
                    {
                        ++counter.data[d];
                        break;
                    }
                    counter.data[d] = '0';
                }

                // Decimal order increases only when `counter MOD 30 == 10`
                if (unlikely(counter.data[counter.offset - 1] != '0'))
                {
                    if (unlikely(counter.offset == 1))
                    {
                        goto end;
                    }

                    --counter.offset;
                }
            }

            // Write the chunk's remaining 20 counters
            {
                const char *const end = counter.data + counter.offset;
                const unsigned int n = sizeof(counter.data) - counter.offset - 1;

                CTRCPY("1"); // 11

                mcpy(off, FIZZ DELIM, sizeof(FIZZ DELIM) - 1); // Fizz (12)
                off += sizeof(FIZZ DELIM) - 1;

                CTRCPY("3"); // 13
                CTRCPY("4"); // 14

                mcpy(off, FIZZ BUZZ DELIM, sizeof(FIZZ BUZZ DELIM) - 1); // FizzBuzz (15)
                off += sizeof(FIZZ BUZZ DELIM) - 1;

                CTRCPY("6"); // 16
                CTRCPY("7"); // 17

                mcpy(off, FIZZ DELIM, sizeof(FIZZ DELIM) - 1); // Fizz (18)
                off += sizeof(FIZZ DELIM) - 1;

                CTRCPY("9"); // 19

                mcpy(off, BUZZ DELIM FIZZ DELIM, sizeof(BUZZ DELIM FIZZ DELIM) - 1); // Buzz (20) Fizz (21)
                off += sizeof(BUZZ DELIM FIZZ DELIM) - 1;

                // Carry handling on MOD 10
                for (unsigned d = NN; ; --d)
                {
                    if (counter.data[d] != '9')
                    {
                        ++counter.data[d];
                        break;
                    }
                    counter.data[d] = '0';
                }

                CTRCPY("2"); // 22
                CTRCPY("3"); // 23

                mcpy(off, FIZZ DELIM BUZZ DELIM, sizeof(FIZZ DELIM BUZZ DELIM) - 1); // Fizz (24) Buzz (25)
                off += sizeof(FIZZ DELIM BUZZ DELIM) - 1;

                CTRCPY("6"); // 26

                mcpy(off, FIZZ DELIM, sizeof(FIZZ DELIM) - 1); // Fizz (27)
                off += sizeof(FIZZ DELIM) - 1;

                CTRCPY("8"); // 28
                CTRCPY("9"); // 29

                mcpy(off, FIZZ BUZZ DELIM, sizeof(FIZZ BUZZ DELIM) - 1); // FizzBuzz (30)
                off += sizeof(FIZZ BUZZ DELIM) - 1;

                // Carry handling on MOD 10
                for (unsigned d = NN; ; --d)
                {
                    if (counter.data[d] != '9')
                    {
                        ++counter.data[d];
                        break;
                    }
                    counter.data[d] = '0';
                }
            }
        }

        os_write(out, buf, ALIGNED_BUF);
        mcpy(buf, buf + ALIGNED_BUF, (off - buf) % ALIGNED_BUF);
        off -= ALIGNED_BUF;
    }

end:
    os_write(out, buf, off - buf);

    return 0;
}

Compiled as clang -o fizz fizz.c -O3 -march=native (with clang 11.0.0 on my Ubuntu 20.10 installation, running kernel version 5.8.0-26.27-generic 5.8.14 on an Intel Core i7-8750H mobile CPU while plugged into the wall), this produces ~3.8GiB/s when run as ./fizz | pv > /dev/null (not very steady due to write blocking every once in a while, but there's nothing I can do about that when single-threaded, I guess).

EDIT: Optimised the carry handling a bit, and now I'm getting ~3.9GiB/s on my machine (same configuration as above).

Interesting, GCC sees through your memcpy(dst,src,4) loop and decides to turn into one variable-sized call to libc memcpy. godbolt.org/z/14K5Pd. But clang does turn that nasty-looking memcpy(dst,src,4) repeat-loop into vmovups instructions that load/store 32 bytes at a time, not just 4! godbolt.org/z/jM3f8K. So basically I think it turns (part of?) your manual memcpy back into one big memcpy like gcc does, but then inlines it instead of calling the libc function. — Peter Cordes, Commented Nov 16, 2020 at 20:23
You don't need a macro for __builtin_memcpy, just use memcpy. GCC and clang define memcpy as a builtin (same as __builtin_memcpy) by default, unless you use -fno-builtin-memcpy or -fno-builtin (which you'd use for writing unit-tests or benchmarks for a hand-written libc or kernel implementation for example, to make it actually call the function instead of inlining or optimizing away). — Peter Cordes, Commented Nov 16, 2020 at 20:26
@PeterCordes you're right about __builtin_memcpy in this case, but as an embedded developer (where GCC/clang don't assume that memcpy can be inlined with constant size) it's an instinct for me to reach for __builtin_memcpy in order to ensure inlining. — Isaac G., Commented Nov 16, 2020 at 20:37
Ok, so you're used to compiling with -fno-builtin for embedded systems. Oh right, that's implied by -ffreestanding, unfortunately even for the functions gcc requires the freestanding environment to provide (memcpy, memmove, memset and memcmp). That's unfortunate; you end up wanting a #define to the __builtin version for every ISO C standard function you do implement. — Peter Cordes, Commented Nov 16, 2020 at 20:49

Toby Speight · Accepted Answer · 2021-01-06 11:43:16Z

39

I was struggling to get more than 2.75GB/s on my rig but then I realised I wasn't compiling with -O3 which bumped me up to 6.75GB/s.

#include <stdio.h>
#include <string.h>
#include <unistd.h>
char buf[416];
char out[65536 + 4096] = "1\n2\nFizz\n4\nBuzz\nFizz\n7\n8\nFizz\n";
int main(int argc, char **argv) {
  const int o[16] = { 4, 7, 2, 11, 2, 7, 12, 2, 12, 7, 2, 11, 2, 7, 12, 2 };
  char *t = out + 30;
  unsigned long long i = 1, j = 1;
  for (int l = 1; l < 20; l++) {
    int n = sprintf(buf, "Buzz\n%llu1\nFizz\n%llu3\n%llu4\nFizzBuzz\n%llu6\n%llu7\nFizz\n%llu9\nBuzz\nFizz\n%llu2\n%llu3\nFizz\nBuzz\n%llu6\nFizz\n%llu8\n%llu9\nFizzBuzz\n%llu1\n%llu2\nFizz\n%llu4\nBuzz\nFizz\n%llu7\n%llu8\nFizz\n", i, i, i, i, i, i, i + 1, i + 1, i + 1, i + 1, i + 1, i + 2, i + 2, i + 2, i + 2, i + 2);
    i *= 10;
    while (j < i) {
      memcpy(t, buf, n);
      t += n;
      if (t >= &out[65536]) {
        char *u = out;
        do {
          int w = write(1, u, &out[65536] - u);
          if (w > 0) u += w;
        } while (u < &out[65536]);
        memcpy(out, out + 65536, t - &out[65536]);
        t -= 65536;
      }
      char *q = buf;
      for (int k = 0; k < 16; k++) {
        char *p = q += o[k] + l;
        if (*p < '7') *p += 3;
        else {
          *p-- -= 7;
          while (*p == '9') *p-- = '0';
          ++*p;
        }
      }
      j += 3;
    }
  }
}

edited Jan 6, 2021 at 11:43

Toby Speight

6,3481 gold badge24 silver badges41 bronze badges

answered Nov 15, 2020 at 14:04

Neil

173k12 gold badges72 silver badges276 bronze badges

\$\begingroup\$ Nicely done! The printed buffer reuse is something that intended to do as well, but didn't get around to due to the complexity of my code. Impressive work! \$\endgroup\$
– Isaac G.
Commented Nov 15, 2020 at 14:46
1

\$\begingroup\$ Very nice. I tried parallelizing your and @IsaacG.'s solutions, but unfortunately two threads is already enough to cause enough contention on my 3900X to get less overall performance out of it than out of the single-threaded versions! Two threads pinned to the right cores seem to get it up to 8 GB/s for just a second or two, probably just the CPU boosting frequencies for a moment. Afterwards there is no benefit. \$\endgroup\$
– G. Sliepen
Commented Nov 15, 2020 at 18:56
1

\$\begingroup\$ What CPU (model number and/or uarch and frequency it actually ran at), compiler version, and OS? Benchmark numbers are most useful when we know the test system. \$\endgroup\$
– Peter Cordes
Commented Nov 16, 2020 at 18:41
1

\$\begingroup\$ @PeterCordes My rig? it's an AMD Ryzen 7 3700X 8-core @ 2.2-3.6GHz using GCC 9.3.1 on Fedora 31. \$\endgroup\$
– Neil
Commented Nov 16, 2020 at 18:50
2

\$\begingroup\$ Consider using -O3 -march=native to let it take advantage of CPU features like AVX (32-byte vectors) if it ever wants to inline a memcpy or something. (The -mtune=znver2 implied by -march on your machine may help, too.) With memcpy size being runtime variable, it will just call the library function. so -fno-plt should reduce overhead for each call. Also -fno-pie -no-pie may make global arrays slightly more efficient (mov-immediate instead of RIP-relative LEA to put the address in a register, or t-&out[65536] could be a sub-immediate or maybe even LEA.) \$\endgroup\$
– Peter Cordes
Commented Nov 16, 2020 at 20:10

| Show 7 more comments

David Frank · Accepted Answer · 2024-03-05 22:54:27Z

283 GB/s output on AMD Ryzen 9 7700X.

To build (tested with GCC 13):

g++ fizzbuzz.cc -march=native -o fizzbuzz -O3 -Wall -std=c++20 -fno-tree-vectorize -fno-exceptions

The build takes a few minutes to complete. Compiling with or without -fno-tree-vectorize may yield better runtime performance depending on the CPU.

To benchmark:

Install pv (ensure you have 1.6.6, later versions have an issue which makes the throughput lower when specifying -B)
Run

taskset -c 0-6 ./fizzbuzz | taskset -c 7 pv -B 2M > /dev/null

Requires Linux 2.6.17 or later.

Performance tuning

The value of the kParallelism constant in fizzbuzz.cc should be set to available CPU cores or less.
The program uses kParallelism threads. It's worth trying different cpu affinities to see what gives the best performance. The number of cores assigned by taskset should be equal to kParallelism
For maximum performance, turn off mitigations (it's recommended to reenable mitigations after benchmarking since they protect against CPU vulnerabilities).

/proc/sys/fs/pipe-max-size must be at least 14680064 (14MB) or alternatively the program must be run as root (sudo ...)

The code

#include <array>
#include <charconv>
#include <cstddef>
#include <cstdint>
#include <cstdlib>
#include <cstring>
#include <fcntl.h>
#include <iostream>
#include <optional>
#include <thread>
#include <unistd.h>
#include <sys/mman.h>

namespace {

// Constexpr helper for calculating 10^X since std::pow is not constexpr.
constexpr int64_t PowTen(int x) {
  int64_t result = 1;
  for (int i = 0; i < x; ++i) {
    result *= 10;
  }
  return result;
}

// We process each batch in parallel from |kParallelism| threads. This number
// should be set to the available CPU cores or less. Note that higher number
// doesn't necessarily mean better performance due to the synchronization
// overhead between threads.
constexpr int64_t kParallelism = 7;
// The last kSuffixDigits digits of each number line are untouched when
// iterating.
constexpr int64_t kSuffixDigits = 6;
// Increment the first right-most touched digit by this much in one step. Must
// be divisible by 3. The code currently only handles when this is single-digit.
constexpr int64_t kIncrementBy = 3;
// One batch contains maximum this many lines.
constexpr int64_t kMaxLinesInBatch = PowTen(kSuffixDigits) * kIncrementBy / 3;

constexpr int kFizzLength = 4;
constexpr int kBuzzLength = 4;
constexpr int kNewlineLength = 1;

// A barrier that busy waits until all other threads reach the barrier.
template <typename Completion>
class SpinningBarrier {
 public:
  // Constructs a spinning barrier with |count| participating threads and
  // completion callback |completion_cb|.
  // After all threads reach the barrier, the last thread executes the
  // completion callback. The other threads are blocked until the completion
  // callback returns.
  SpinningBarrier(int64_t count, Completion completion_cb) :
      count_(count), spaces_(count), generation_(0),
      completion_cb_(completion_cb) {}

  void Wait() {
      int64_t my_generation = generation_;
      if (!--spaces_) {
          spaces_ = count_;
          completion_cb_();
          ++generation_;
      } else {
          while(generation_ == my_generation);
      }
  }

 private:
  int64_t count_;
  std::atomic<int64_t> spaces_;
  std::atomic<int64_t> generation_;
  Completion completion_cb_;
};

// Owns the output buffers and maintains which buffer was used last.
class OutputHandler {
 static constexpr size_t kBufferSize = 14 * 1024 * 1024;

 public:
  OutputHandler() {
    for (int i = 0; i < 3; ++i) {
      buffers_[i].reset(static_cast<char*>(
        std::aligned_alloc(2 * 1024 * 1024, kBufferSize)));
      madvise(buffers_[i].get(), kBufferSize, MADV_HUGEPAGE);
    }
  }

  void Output(int buffer_id, size_t bytes) {
    // We use three buffers. We have to ensure that while a buffer (or its
    // part) is in the pipe, it won't get modified. There is no API to know
    // when a downstream process is finished reading some data from the pipe,
    // so we choose the size of the pipe smartly.
    // As long as the pipe cannot fit more than two full buffers, we can ensure
    // that after after outputting buffer 0, 1, 2 in this order, the pipe no
    // longer contains data from buffer 0. However, if we make the pipe too
    // small, the program will be slower. The optimal pipe size is calculated by
    // TargetPipeSize. Since there is a minimum pipe size which we
    // cannot go below (4kb on Linux), this approach won't work when the
    // buffer size is too small. In these cases we fall back to write() which
    // copies the content into the pipe, therefore there is no risk of
    // overwriting memory that is still being read from the downstream process.
    // However, if in the subsequent call to Output(), a smaller size were
    // passed (and therefore the else branch were executed), the pipe could
    // still end up containing some data from the current iteration and the
    // entire data from the next iteration. We assume that Output() will be
    // invoked with monotonically increasing sizes (which is true in practice
    // but it'd be better not to depend on this assumption).
    SetPipeSize(TargetPipeSize(bytes));
    if (2 * bytes >= pipe_size_) {
      OutputWithVmSplice(buffers_[buffer_id].get(), bytes);
    } else {
      if (write(STDOUT_FILENO, buffers_[buffer_id].get(), bytes) < 0) {
        std::cerr << "write error: " << errno;
        std::abort();
      }
    }
  }

  char* GetBuffer(int buffer_id) {
    return buffers_[buffer_id].get();
  }

  // Returns the next buffer id that can be filled up and outputted.
  // Callers are responsible to actually output the buffer after requesting it
  // with this method.
  int NextBufferId() {
    buffer_id_ = (buffer_id_ + 1) % 3;
    return buffer_id_;
  }

  static constexpr int64_t BufferSize() {
    return kBufferSize;
  }

 private:
  // Calculates the optimal pipe size for outputting |out_bytes|.
  size_t TargetPipeSize(size_t out_bytes) const {
    // Pipe sizes must be powers of 2 and >= 4kb on Linux.
    // We want that the pipe is not bigger than twice the output (but still
    // maximize the pipe size), so we round |out_bytes| up to the nearest power
    // of two.
    return std::max(4ul * 1024, std::bit_ceil(out_bytes));
  }

  void OutputWithVmSplice(char* buffer, size_t bytes) const {
    iovec iov;
    iov.iov_base = buffer;
    iov.iov_len = bytes;
    while (true) {
      int64_t ret = vmsplice(STDOUT_FILENO, &iov, 1, SPLICE_F_NONBLOCK);
      if (ret >= 0) {
        iov.iov_len -= ret;
        iov.iov_base = reinterpret_cast<char*>(iov.iov_base) + ret;
        if (iov.iov_len == 0) {
          break;
        }
      } else {
        if (errno != EAGAIN) {
          std::cerr << "vmsplice error: " << errno;
          std::abort();
        }
      }
    }
  }

  void SetPipeSize(size_t size) {
    if (pipe_size_ == size) {
      return;
    }
    size_t new_pipe_size = fcntl(STDOUT_FILENO, F_SETPIPE_SZ, size);
    if (new_pipe_size < 0) {
      std::cerr << "Error while calling fcntl F_SETPIPE_SZ " << errno
      << "\nPerhaps you need to update /proc/sys/fs/pipe-max-size or "
         "run the program as sudo";
      std::abort();
    }

    pipe_size_ = new_pipe_size;
  }

  std::array<std::unique_ptr<char[], decltype([](char* x) {std::free(x);})>, 3>
    buffers_;
  int buffer_id_ = 0;
  size_t pipe_size_;
};

// Inserts the fizzbuzz line for line number |line| and a newline character
// into |out|.
// Returns the pointer pointing to the character after the newline.
char* InsertFizzBuzzLine(char* out, int64_t line) {
  if (line % 15 == 0) {
    std::memcpy(out, "FizzBuzz\n", 9);
    return out + 9;
  } else if (line % 3 == 0) {
    std::memcpy(out, "Fizz\n", 5);
    return out + 5;
  } else if (line % 5 == 0) {
    std::memcpy(out, "Buzz\n", 5);
    return out + 5;
  } else {
    // We support numbers up to 10^20.
    char* next = std::to_chars(out, out + 20, line).ptr;
    *next = '\n';
    return next + 1;
  }
}

// A run refers to all lines where the line numbers have |DIGITS| digits.
// Run<1>: [1,9]
// Run<2>: [10,99]
// ...
template<int DIGITS>
class Run {

  static_assert(DIGITS >= 1);

  static constexpr int FizzBuzzLineLength(int64_t number_mod_15) {
    if (number_mod_15 % 15 == 0) {
      return 9;
    } else if (number_mod_15 % 3 == 0) {
      return 5;
    } else if (number_mod_15 % 5 == 0) {
      return 5;
    } else {
      return DIGITS + 1;
    }
  }

  // Returns the size of one fifteener in bytes.
  static constexpr size_t FifteenerBytes() {
    size_t size = 0;
    for (int i = 0; i < 15; ++i) {
      size += FizzBuzzLineLength(i);
    }
    return size;
  }

  // Returns the number of lines in this run.
  static constexpr int64_t LinesInRun() {
    return PowTen(DIGITS) - PowTen(DIGITS - 1);
  }

  // The entire fizz-buzz output for this run takes this many bytes.
  static constexpr size_t RunBytes() {
    if constexpr(DIGITS == 1) {
      return 5 + 3 * kFizzLength + 1 * kBuzzLength + 9 * kNewlineLength;
    } else {
      return LinesInRun() / 15 * FifteenerBytes();
    }
  }

  // Returns the number of batches in this run.
  static constexpr int64_t BatchesInRun() {
    if constexpr (DIGITS > kSuffixDigits) {
      return PowTen(DIGITS - kSuffixDigits - 1) * 9;
    } else {
      return 1;
    }
  }

 public:
  // Outputs all lines for this run by using the buffers from |output_handler|.
  static void Execute(OutputHandler& output_handler) {
    Batch<0> batch0(&output_handler);
    Batch<1> batch1(&output_handler);
    Batch<2> batch2(&output_handler);
    // We fill up each batch with the initial values. This is a relatively slow
    // process so we only do it once per run. In subsequent iterations, we
    // only increment the numbers (see below) which is much faster.
    batch0.Init();
    batch0.Output();
    if constexpr (BatchesInRun() > 1) {
      batch1.Init();
      batch1.Output();
    }
    if constexpr (BatchesInRun() > 2) {
      batch2.Init();
      batch2.Output();
    }

    if constexpr (BatchesInRun() > 3) {
      int64_t prefix = PowTen(DIGITS - kSuffixDigits - 1);
      // We update the batch from |kParallelism| threads
      // We use a spinning barrier for synchronizing between the threads.
      // After all threads reach the barrier, the completion function is
      // executed and the output is written out. Then the next batch is
      // processed.
      SpinningBarrier barrier(kParallelism, [&] {
        switch (prefix % 3) {
          // In the beginning
          //   batch0 corresponds to prefix 10..00 ( ≡ 1 mod 3),
          //   batch1 corresponds to prefix 10..01 ( ≡ 2 mod 3),
          //   batch2 corresponds to prefix 10..02 ( ≡ 0 mod 3).
          // After all 3 batches are processed, the prefix is incremented by 3,
          // hence the mods don't change.
          case 0: batch2.Output(); break;
          case 1: batch0.Output(); break;
          case 2: batch1.Output(); break;
        }
        prefix++;
      });

      [&]<size_t... THREAD_ID>(std::index_sequence<THREAD_ID...>) {
        // Launch |kParallelism| number of threads. We could also use a thread
        // pool, but one run takes long enough that launching new threads is
        // negligible.
        (std::jthread([&] {
          for (int64_t batch = 3; batch < BatchesInRun();
               batch += 3) {
            // Each thread processes their corresponding chunk in the batch.
            Chunk<0, THREAD_ID>(batch0).IncrementNumbers(prefix);
            // At this point, all threads wait until every other thread reaches
            // the barrier, the last thread to finish will invoke batch.Output()
            // (see above at the definition of |barrier|).
            barrier.Wait();
            Chunk<1, THREAD_ID>(batch1).IncrementNumbers(prefix);
            barrier.Wait();
            Chunk<2, THREAD_ID>(batch2).IncrementNumbers(prefix);
            barrier.Wait();
          }
        }) , ...);
      }(std::make_index_sequence<kParallelism>());
    }
  }

  // A batch represents 10^|kSuffixDigits| lines of the output.
  // This is useful because the last |kSuffixDigits| digits don't need to be
  // updated. Furthermore, line numbers in one batch share the same prefix.
  // BATCH_ID ∈ [0, 1, 2]
  template<int BATCH_ID>
  class Batch {
    static_assert(BATCH_ID < 3);
    using PreviousBatch = Batch<BATCH_ID - 1>;

    public:
    Batch(OutputHandler* output_handler) : output_handler_(output_handler) {
      static_assert(OutputHandler::BufferSize() >= BytesInBatch());
    }

    // Initializes this batch by taking the next available buffer from
    // the output handler and filling it with the initial values.
    void Init() {
      buffer_id_ = output_handler_->NextBufferId();
      char* out = GetBuffer();
      int64_t start = PowTen(DIGITS - 1) + BATCH_ID * LinesInBatch();
      int64_t end = std::min(PowTen(DIGITS), start + LinesInBatch());
      for (int64_t line = start; line < end; ++line) {
        out = InsertFizzBuzzLine(out, line);
      }
    }

    // Returns the first line number of this chunk mod 15.
    static constexpr int64_t FirstLineNumberMod15() {
      if constexpr (BATCH_ID == 0) {
        return DIGITS > 1 ? 10 : 1;
      } else {
        return (PreviousBatch::FirstLineNumberMod15() +
          PreviousBatch::LinesInBatch()) % 15;
      }
    }

    // Returns the number of lines in this batch.
    static constexpr int64_t LinesInBatch() {
      return std::min(kMaxLinesInBatch, LinesInRun());
    }

    // Returns the size of this batch in bytes.
    static constexpr int64_t BytesInBatch() {
      if constexpr (LinesInBatch() < kMaxLinesInBatch) {
        return RunBytes();
      } else {
        size_t size = LinesInBatch() / 15 * FifteenerBytes();
        for (int64_t i = FirstLineNumberMod15() + LinesInBatch() / 15 * 15;
             i < FirstLineNumberMod15() + LinesInBatch(); ++i) {
          size += FizzBuzzLineLength(i);
        }
        return size;
      }
    }

    void Output() {
      output_handler_->Output(buffer_id_, BytesInBatch());
    }

    char* GetBuffer() {
      return output_handler_->GetBuffer(buffer_id_);
    }

    OutputHandler* output_handler_;
    // The buffer id that this batch should use in |output_handler_|.
    int buffer_id_;
  };

  // Represents a chunk, a part of batch processed by thread with id
  // |THREAD_ID|. THREAD_ID ∈ [0, kParallelism)
  // Since numbers in each chunk need to be incremented at different indexes,
  // we specialize this class for each BATCH_ID and THREAD_ID so the indexes can
  // be precomputed at compile time.
  template<int BATCH_ID, int THREAD_ID>
  class Chunk {
    using PreviousChunk = Chunk<BATCH_ID, THREAD_ID - 1>;

    public:
    // Initializes a chunk that resides in |batch|.
    Chunk(Batch<BATCH_ID> batch) : batch_(batch) {}

    // Returns the first line number of this chunk mod 15.
    static constexpr int64_t FirstLineNumberMod15() {
      if constexpr (THREAD_ID == 0) {
        return Batch<BATCH_ID>::FirstLineNumberMod15();
      } else {
        return (PreviousChunk::FirstLineNumberMod15() +
          PreviousChunk::LinesInChunk()) % 15;
      }
    }

    // Returns the index of the start byte of this chunk in the batch.
    static constexpr int64_t StartIndexInBatch() {
      if constexpr (THREAD_ID == 0) {
        return 0;
      } else {
        return PreviousChunk::StartIndexInBatch() +
          PreviousChunk::BytesInChunk();
      }
    }

    // Returns the number of lines in this chunk.
    static constexpr int64_t LinesInChunk() {
      int64_t done = THREAD_ID == 0 ? 0 :
        PreviousChunk::CumulativeLinesUpToChunk();
      int64_t remaining_lines = Batch<BATCH_ID>::LinesInBatch() - done;
      int64_t remaining_threads = kParallelism - THREAD_ID;
      // equivalent to ceil(remaining_lines / remaining_threads)
      return (remaining_lines - 1) / remaining_threads + 1;
    }

    // Returns the number of lines in this and all previous chunks in the batch.
    static constexpr int64_t CumulativeLinesUpToChunk() {
      if constexpr (THREAD_ID < 0) {
        return 0;
      } else {
        return PreviousChunk::CumulativeLinesUpToChunk() + LinesInChunk();
      }
    }

    // Returns the length of this chunk in bytes.
    static constexpr int64_t BytesInChunk() {
      size_t size = LinesInChunk() / 15 * FifteenerBytes();
      for (int64_t i = FirstLineNumberMod15() + LinesInChunk() / 15 * 15;
           i < FirstLineNumberMod15() + LinesInChunk(); ++i) {
        size += FizzBuzzLineLength(i);
      }
      return size;
    }

    // Increments all the numbers in the chunk.
    // This function wraps IncrementNumbersImpl for efficiently dispatching to
    // specialized versions based on |prefix|.
    void IncrementNumbers(int64_t prefix) {
      // If DIGITS < kSuffixDigits, it means that all the numbers within a run
      // will fit into a single batch, so we should not use IncrementNumbers().
      // The below implementation would not even work.
      static_assert(DIGITS >= kSuffixDigits);
      constexpr int64_t max_overflow_digits = DIGITS - kSuffixDigits;
      // Contains an IncrementChunkImpl() specialization for each value in
      // 0..max_overflow_digits. We use it to jump to the right specialization.
      constexpr auto increment_chunk_impls = []() {
        std::array<void (*)(char*), max_overflow_digits + 1> res{};
        [&]<size_t... OVERFLOW_DIGITS>(std::index_sequence<OVERFLOW_DIGITS...>) {
          ((res[OVERFLOW_DIGITS] = &IncrementNumbersImpl<OVERFLOW_DIGITS>), ...);
        }(std::make_index_sequence<max_overflow_digits + 1>());
        return res;
      }();

      increment_chunk_impls[OverflowDigits(prefix)](batch_.GetBuffer());
    }

    private:
    // Increments this chunk in |batch|.
    //
    // Each number line is incremented by |kIncrementBy| * 10^kSuffixDigits.
    // If OVERFLOW_DIGITS > 0, we assume that the operation will overflow,
    // therefore, we need to increment this many digits beforehand. It's the
    // caller's responsibility to calculate the number of digits that will need
    // to be updated in this chunk.

    // For example, the chunk if kIncrementBy = 3 and kSuffixDigits = 6, the
    // chunk [100000000, 100999999] can be incremented to [103000000; 103999999]
    // with OVERFLOW_DIGITS = 0 (no overflow).
    // When incrementing [108000000, 108999999] to [111000000; 111999999],
    // OVERFLOW_DIGITS = 1 (one-digit overflow).
    // When incrementing [198000000, 198999999] to [201000000, 201999999],
    // OVERFLOW_DIGITS = 2 (two-digit overflow)
    template<int OVERFLOW_DIGITS>
    static void IncrementNumbersImpl(char* batch) {
      char* out = batch;
      constexpr int64_t start_index = StartIndexInBatch();
      constexpr int first_line_number_mod_15 = FirstLineNumberMod15();

      // Increments the |num_lines| starting from |out|.
      // |num_lines| must be divisible by 120 (except in the last iteration).
      auto increment = [&] (int num_lines) __attribute__((always_inline)) {
        int line_start = 0;
        #pragma GCC unroll 120
        for (int64_t line = 0; line < num_lines; ++line) {
          if (IsFizzBuzzNumber(first_line_number_mod_15 + line)) {
            // In order for the compiler to generate efficient code, the
            // second and third params should be deducible to constants.
            // Since the loop is unrolled, the value of |line_start| is
            // known in every iteration. |start_index| is constexpr, so
            // its value is also known.
            IncrementNumber(out, line_start + start_index, OVERFLOW_DIGITS);
          }
          line_start +=
              FizzBuzzLineLength((first_line_number_mod_15 + line) % 15);
        }
        // Since num_lines is a multiply of 120, the right hand side is a
        // multiply of 8 which ensures that |out| is aligned to 8 bytes
        // afterwards.
        out += FifteenerBytes() * num_lines / 15;
      };

      for (int64_t i = 0; i < LinesInChunk() / 120; ++i) {
        increment(120);
      }
      increment(LinesInChunk() % 120);
    }

    // Returns whether this number is printed as-is ie. it's not a multiply of 3
    // or 5.
    static constexpr bool IsFizzBuzzNumber(int64_t number) {
      return number % 3 != 0 && number % 5 != 0;
    }

    // Increments the number starting at base[line_start].
    // |base| must be aligned to 8 bytes. The caller must guarantee that the
    // number of overflows that occur is |overflow_digits|.
    // For maximum performance, |line_start| should be deducible at compile
    // time.
    __attribute__((always_inline))
    static inline void IncrementNumber(char* base,
                                       int64_t line_start,
                                       int overflow_digits) {
      int64_t right_most_digit_to_update_index =
          line_start + DIGITS - 1 - kSuffixDigits;
      // When overflow_digits is known at compile time, all the IncrementAt
      // calls that affect the same 8-byte integer are combined into 1
      // instruction by the compiler.
      IncrementAt(base, right_most_digit_to_update_index, kIncrementBy);
      #pragma GCC unroll 100
      for (int i = 0; i < overflow_digits; ++i) {
        IncrementAt(base, right_most_digit_to_update_index, -10);
        IncrementAt(base, right_most_digit_to_update_index - 1, 1);
        right_most_digit_to_update_index--;
      }
    }

    // Increments the byte at |index| in |base| by |by|.
    // |base| must by aligned to 8 bytes.
    // For maximum performance, |index| and |by| should be deducible by the
    // compiler to constants.
    __attribute__((always_inline))
    static inline void IncrementAt(char* base, int64_t index, char by) {
      union char_array_int64 {
        char ch[8];
        int64_t int64;
      };
      auto base_as_union = reinterpret_cast<char_array_int64*>(base);
      // The code below only works on little endian systems.
      static_assert(std::endian::native == std::endian::little);
      // Increment the character at index |index| by |by|. This works because
      // we can guarantee that the character won't overflow.
      base_as_union[index / 8].int64 +=
        static_cast<int64_t>(by) << ((index % 8) * 8);
    }

    // Returns the number of digits that will overflow when incrementing
    // |prefix| by |kIncrementBy|.
    // Eg. if kIncrementBy = 3:
    // OverflowDigits(100) = 0 (no digits overflow)
    // OverflowDigits(108) = 1 (8 overflows and 0 is incremented by 1)
    // OverflowDigits(198) = 2 (8 overflows and 9 overflows)
    static int OverflowDigits(int64_t prefix) {
      int incremented = prefix + kIncrementBy;
      #pragma GCC unroll 2
      for (int i = 0; i < 20; ++i) {
        incremented /= 10;
        prefix /= 10;
        if (incremented == prefix) {
          return i;
        }
      }
      return 20;
    }

    Batch<BATCH_ID> batch_;
  };
};

} // namespace

int main() {
  OutputHandler output_handler;

  [&]<std::size_t... I>(std::index_sequence<I...>){
    (Run<I + 1>::Execute(output_handler), ...);
  }(std::make_index_sequence<18>());

  return 0;
}

The algorithm

I reuse some of the ideas from ais523's answer, namely:

using vmsplice for zero-copy output into the pipe
aligning the output buffers to 2MB and using huge pages

Definitions

line number: the id of each line starting with 1, 2, ...
mod: the line number mod 15
fizzbuzz line: one line of output
fizzbuzz function: a function that translates the line number to a fizzbuzz line according to the fizzbuzz logic
number line: a line of output which is a number (and not fizz, buzz or fizzbuzz)
fifteener: 15 lines of consecutive output
batch: 1,000,000 lines of consecutive output
run: consecutive output where the line numbers have the same number of digits in base 10, eg. run(6) is the output for line numbers: 100000 ... 999999

A few observations

Observation 1: within each fifteener, the number lines are always at the same indices, namely at indices 1, 2, 4, 7, 8, 11, 13 and 14

Observation 2: each run with 2+ digits contains a whole number of fifteeners

Observation 3: each run with 2+ digits starts with mod = 10 because 10^N ≡ 10 (mod 15) for N > 0

Observation 4: if we have 3 batches (3,000,000 lines) of output in a buffer, we can get the next 3 batches by incrementing the 6th digit (0-indexed) from the right of each number line by 3 in each batch. We can keep other digits untouched. We'll call the last 6 digits of the number suffix digits, since these will never change in a run. The fizz/buzz/fizzbuzz lines are also untouched.

For example the first batch of run(9) looks like this:

BUZZ
100000001
FIZZ
100000003
100000004
FIZZBUZZ
...
FIZZ
100999999

Second batch of run(9):

BUZZ
FIZZ
101000002
101000003
...
101999998
101999999

Third batch of run(9):

FIZZBUZZ
102000001
102000002
FIZZ
...
102999998
FIZZ

We can get the fourth batch by incrementing the first batch by 3,000,000:

BUZZ
103000001
FIZZ
103000003
103000004
FIZZBUZZ
...
FIZZ
105999999

Incrementing single digits is much faster than recomputing the numbers every time.

We only need to maintain three buffers for the three batches and keep incrementing numbers by 3,000,000.

It's important to note that the number lines in the buffer contain the string representation of the numbers, eg. 103000003 is actually ['1','0','3','0','0','0','0','0','3'] = [49, 48, 51, 48, 48, 48, 48, 48, 51]. Incrementing by 3,000,000 means incrementing the 6th digit (0-indexed) from the right by 3.

Using three buffers also has an addition benefit: we can put up to two buffers into the pipe for the downstream process to read from (see vmsplice and this article) and update the third buffer in the meantime.

The basic algorithm is as follows:

for run in 1..19:
  initialize batch0 with fizz buzz lines between 10^(run-1) and 10^(run-1) + 999,999
  output batch0
  initialize batch1 with fizz buzz lines between 10^(run-1) + 1,000,000 and 10^(run-1) + 1,999,999
  output batch1
  initialize batch2 with fizz buzz lines between 10^(run-1) + 2,000,000 and 10^(run-1) + 2,999,999
  output batch2
  for batch in 3..(number of batches in run):
    increment batch0
    output batch0
    increment batch1
    output batch1
    increment batch2
    output batch2

The algorithm is fast because the increment operation (which is where most of the time is spent) can be optimized really well.

Overflows and carry

A major complication in the above algorithm is when a digit overflows. For example, if we increment the digit '8' in 108399977 by 3, the result is not a digit, so we have to take care of the overflow. We do this by first incrementing '8' by 3, then subtracting 10 and adding 1 to the '0' before the '8' (which is pretty much the process how we'd do it on paper). Furthermore, it can happen that more than even the digit before overflows, e.g. if the number is 198399977. In this case, we:

add 3 to '8'
subtract 10 from '8' + 3
add 1 to '9'
subtract 10 from '9' + 1
add 1 to '1'

The final result is 201399977.

However, checking in each iteration whether an overflow has occurred is pretty slow. This is where batches are useful once again. Since a batch is 1,000,000 lines of output, all numbers in a batch share a common prefix.

122|531269
    ------  suffix (last 6 digits)
---         prefix (all previous digits)

As mentioned above, the suffixes are never touched after the initialization. We only increment the prefix.

The nice property of a batch is that all numbers in a batch overflow the same way, therefore we only have to check once per chunk, how many digits will need to be updated for each number. We call this the overflow count.

We get extra performance gains by incrementing each batch from multiple threads. One section of a batch updated by a thread is called a chunk.

C++ tricks

After discussing the algorithm, here are a few ideas that make this algorithm particularly fast:

8 is better than 1

Previously we talked about incrementing single characters in the buffer but CPUs can work with 8-byte integers faster than with 1-byte integers. Furthermore, if we have to update multiple digits because of overflow, updating 8 bytes at once will reduce the number of instructions.

For this to work, a requirement is that the integers must be aligned at 8 bytes, so we need to know where the 8-byte boundaries are.

Consider the number 12019839977 where we want to add 6 to the digit '8' (and handle overflow). Let's assume that the (one-byte) indexes mod 8 are as follows:

output:        X Y 1 2 0 1 9 8 3 9 9 7 7
index mod 8:   0 1 2 3 4 5 6 7 0 1 2 3 4

X Y is the last two bytes before this number. Let's call the address of X base. This address is aligned to 8 bytes. Instead of updating the single bytes at (base + 7), (base + 6) and (base + 5), we can update the 8 bytes in a single operation using bit shifts.

On little endian systems (like x86) where the least significant byte is at the lowest address, this translates to:

base[index \ 8] += 1 << (5 * 8)  |  (1 - 10) << (6 * 8)  |  (6 - 10) << (7 * 8)
                         ^             ^
                  index mod 8 = 5    increment by 1 - 10 (add carry and handle overflow)

Each update we want to do to the numbers is OR-d together. What's even better is that even if we write individual instructions, the compiler is smart enough to compile it to a single expression as long as the right handsides are compile-time constants:

base[index \ 8] += 1 << (5 * 8);
base[index \ 8] += (1 - 10) << (6 * 8);
base[index \ 8] += (6 - 10) << (7 * 8);

Doing all these bit manipulations at runtime would be slower than just incrementing the numbers one byte at a time, so we'll be ...

Using the compiler for maximum gains

All the calculation needed for the previous step to work fast is done at compile time. A few more observations:

The first batch starts with mod 10, the second batch starts with mod 5, the batch chunk starts with mod 0.
The first batch is aligned at 8 bytes. We can calculate the length of each batch and chunk at compile time.

Using C++ templates, we generate specialized code for each (run digits, batch id, chunk id, overflows) tuple.

run digits: the number of digits of each number line in this run
batch id: 0, 1 or 2 (see the Observation 4 above)
chunk id: to distinguish the chunk in the batch, [0, kParallelism)
overflow count: the number of digits that will overflow after incrementing the last digit of the prefix

In order to support the compiler in generating branchless code, we aggressively unroll loops so conditions and calculations can be done at compile time. The price is a long compile time.

If we inspect the generated assembly, we can see that the compiler generates specialized code which only contains add/sub instructions without any branches.

add QWORD PTR 8[rax], rdx
sub QWORD PTR 40[rax], 1033
add QWORD PTR 32[rax], rdx
add QWORD PTR 56[rax], r8
sub QWORD PTR 88[rax], 4
add QWORD PTR 80[rax], rsi
sub QWORD PTR 104[rax], 67698432
sub QWORD PTR 128[rax], 67698432
sub QWORD PTR 160[rax], 4
[many more add/sub]

Most of the time, we only need 8 instructions for each fifteener.

Feedback / ideas for improvement welcome!

I'm not getting any substantial multi-thread speedups testing on a Skylake CPU (~16GB/s with 1 thread, ~20GB/s with 2+). Additionally the pv "-B 2M" option just lowers these to ~10GB/s and ~7.5GB/s. How important is disabling CPU mitigations? Are there other implied OS/CPU requirements like L2/L3 cache size? THP is set to "madvise". — Austin, Commented Feb 14 at 0:17
@Austin Can you check your pv version, it looks like later versions (like 1.8) work worse when specifying -B, I recommend testing with 1.6.6. Depending on your L3 cache size, you may achieve better results by lowering kSuffixDigits from 6 to 5. If that gives better results, you could then try increasing kIncrementBy from 3 to 6 or 9 (only 3, 6 or 9 is allowed). If you want the code to compile faster for tuning the constants, you can edit the main function. Change <18> to <1>, and I + 1 to I + 14. Then the program will only output 14-digit numbers but it compiles much faster. — David Frank, Commented Feb 16 at 11:10
Ok, with tweaking those constants so it stays in L3 (kSuffixDigits=4 so it fits in L2 is sometimes even better since Intel's L3 is so slow) speeds it up to ~45GB/s (still with no scaling past 2 cores though). Downgrading pv so "-B" doesn't disable splice() gets ~55GB/s with 2 cores and 85GB/s with 5 for me. Does it scale better on your 7700X or is its single thread performance just that much better? — Austin, Commented Feb 21 at 4:03
This is mostly speculation, but here are a few things that could cause the difference: 1) on AMD Zen4 architecture, 64-bit addition is done in 41% less core cycles compared to Skylake (I used the data from uops.info), which makes up the vast majority of the program. 2) The 7700X has 4.5 Ghz clock speed with 5.4 GHz boost - what's yours? 3) Every increment in the kSuffixDigits constant will achieve 10x more work before synchronizing CPU cores but requires 10x as much L3 cache as well. Synchronizing between cores can be relatively expensive, so doing it less frequently makes things faster. — David Frank, Commented Feb 22 at 9:06
4) Turning off CPU mitigations gave a 30% performance boost on my machine (Ubuntu 22.04). — David Frank, Commented Feb 22 at 9:09

Cristian Ciupitu · Accepted Answer · 2022-04-29 19:25:16Z

23

I tweaked Neil's code a bit (so most credit goes to him) and managed to squeeze some more performance out of it; I also prepared it for unrolling more loops but ultimately I gave up (that's why the code is unreadable gobbledygook).

#include <stdio.h>
#include <string.h>
#include <unistd.h>

#define f(Z) {char*p=q+=Z+l;if(*p<'7')*p+=3;else{*p---=7;while(*p=='9')*p--='0';++*p;}}
#define v(N) {while(j<i){memcpy(t,buf,N);t+=n;if(t>=&out[65536]){char*u=out; \
           do{int w=write(1,u,&out[65536]-u);if(w>0)u+=w;}while(u<&out[65536 \
           ]);memcpy(out,out+65536,t-&out[65536]);t-=65536;}char*q=buf;f(4); \
           f(7);f(2);f(11);f(2);f(7);f(12);f(2);f(12);f(7);f(2);f(11);f(2);f \
           (7);f(12);f(2);j+=3;}}
char buf[256];
char out[65536 + 4096] = "1\n2\nFizz\n4\nBuzz\nFizz\n7\n8\nFizz\n";
int main(void) {
  char *t = out + 30;
  unsigned long long i = 1, j = 1;
  for (int l = 1; l < 20; l++) {
    int n=sprintf(buf, "Buzz\n%llu1\nFizz\n%llu3\n%llu4\nFizzBuzz\n%llu6\n%llu7\nFizz\n%llu9\nBuzz\nFizz\n%llu2\n%llu3\nFizz\nBuzz\n%llu6\nFizz\n%llu8\n%llu9\nFizzBuzz\n%llu1\n%llu2\nFizz\n%llu4\nBuzz\nFizz\n%llu7\n%llu8\nFizz\n", i, i, i, i, i, i, i + 1, i + 1, i + 1, i + 1, i + 1, i + 2, i + 2, i + 2, i + 2, i + 2);
    i*=10;
    v(n);
  }
  return 0;
}

On my PC, Neil's submission is ~5% slower. I also tried it on friend's Intel box and the tweaked version is faster.

edited Apr 29, 2022 at 19:25

Cristian Ciupitu

1156 bronze badges

answered Nov 15, 2020 at 21:35

Kamila Szewczyk

12.4k1 gold badge29 silver badges60 bronze badges

2

\$\begingroup\$ Added to the scoreboard. This is the best one so far. You seem to also have made Neil's submission much more "stable" (less variance in the output rate). \$\endgroup\$
– Omer Tuchfeld
Commented Nov 15, 2020 at 22:59
\$\begingroup\$ @IsaacG pointed out that my buf is actually too small and my program would crash if it ever got to the really high numbers. \$\endgroup\$
– Neil
Commented Nov 16, 2020 at 22:32
\$\begingroup\$ @Neil Yeah, I noticed this too, but if I tried to enlargen it the performance gets a solid drop. So we're just hoping it works lol \$\endgroup\$
– Kamila Szewczyk
Commented Nov 17, 2020 at 6:25
\$\begingroup\$ Huh, so it does, but I tried 416 and it seems to be OK, so... \$\endgroup\$
– Neil
Commented Nov 17, 2020 at 10:09
\$\begingroup\$ i feel like the buffer is too big to fit in l1 cache, not that it affects anything... \$\endgroup\$
– ASCII-only
Commented Feb 9, 2021 at 6:41

Add a comment |

Paolo Bonzini · Accepted Answer · 2021-10-29 22:10:54Z

Here is my attempt at using just-in-time compilation to emit fast FizzBuzz assembly that is specialized for every digit length. It's basically the same idea as Neil's answer, just more overengineered. A further 2x comes from the vmsplice system call as in the winning answer. While there are a few other similarities in the AVX2 code, the usage of vmsplice is the only bit that I downright "stole" from there; all the vector code is my own.

The basic idea is to extract 32 bytes out of a prebuilt set of 32 characters that includes the current line number divided by ten (lo_bytes), the digits 0-9 (hi_bytes 0-9), the letters in Fizz and Buzz (hi_bytes 11-15) and the newline character (hi_bytes byte 10). There are some extra complications:

about half of the time the 32 bytes must be extracted in two steps, with the increment of lo_bytes inserted between the two extractions. The alternation of "mov", "or", "store", "increment lo_bytes" and "end of run" operations is stored as a kind of "bytecode". First, the program generates a string with the template of the FizzBuzz output for 30 consecutive numbers (30 is the LCM of 3, 5 and 10); then, to generate the bytecode, it operates on three substrings corresponding to 10 consecutive numbers.
the AVX2 vpshufb instruction operates on two "lanes" of 128 bits. Therefore it can only gather from bytes 0-15 into bytes 0-15, and from bytes 16-31 into bytes 16-31. It can also place a zero in any byte though, which comes in really handy. This is why there are separate "lo_bytes" and "hi_bytes". Each mask is split in two parts, one for the "lo_bytes" and one for the "hi_bytes"; each is vpshufb'ed while filling the bytes that come from the "wrong" character with a zero, and then the two parts are ORed together.

I mentioned the bytecode before. There are two reasons why the program does not go directly to x86 code. First, going through bytecode simplifies noticeably the JIT compiler. Second, the JIT-compiled inner loop does not handle carry from byte 7 to byte 8 of the lo_bytes (which happens every 10^9 numbers written); that part is handled by interpreting the bytecode.

Actually, there's a third reason to have the bytecode, and it is possibly the most important even though it doesn't apply to the code submitted below. The bytecode approach separates very well the tasks of preprocessing (figuring out the exact sequence for each number length) and emitting output; therefore, during development I could first work on the preprocessor while keeping a stupid for loop for the output, then add vectorized C code (which actually survives in the slow path to handle carries), and finally generated the code on the fly. Of course the program was super slow until the introduction of the JIT, but being able to test AVX2 code in C is obviously much easier! The whole implementation only took about 6 hours, hence more optimization is probably possible (sizing the piping buffer, better scheduling of the x86 code, etc.).

The hints about taskset and the same warnings about needing a "useless cat" apply to this program as well due to the use of vmsplice.

The code is not super polished. Variable names are especially horrible, sorry about that.

/*
 * Author: Paolo Bonzini
 * gcc fb.c -o fb -O2 -g -mavx -mavx2 -flax-vector-conversions
 */
#define _GNU_SOURCE

#include <sys/mman.h>
#include <fcntl.h>
#include <sys/uio.h>
#include <unistd.h>
#include <assert.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <immintrin.h>
#include <avxintrin.h>

#define F_SETPIPE_SZ 1031

#define ZERO    15

typedef uint8_t v32qi __attribute__((vector_size(32), aligned(32)));
typedef uint8_t v32qi_u __attribute__((vector_size(32), aligned(1)));

v32qi lo_bytes = {
    '1', '0', '0', '0', '0', '0', '0', '0',     /* 0 */
    '0', '0', '0', '0', '0', '0', '0', '\0',    /* 8 */
    '1', '0', '0', '0', '0', '0', '0', '0',     /* 0 */
    '0', '0', '0', '0', '0', '0', '0', '\0',    /* 8 */
};

uint8_t hi_bytes[16] = {
    '0', '1', '2', '3', '4', '5', '6', '7',     /* 16 */
    '8', '9', '\n', 'z', 'u', 'B', 'i', 'F',    /* 24 */
};

static v32qi biased_zero = {
    246, 246, 246, 246, 246, 246, 246, 246,
    246, 246, 246, 246, 246, 246, 246, 246,
    246, 246, 246, 246, 246, 246, 246, 246,
    246, 246, 246, 246, 246, 246, 246, 246,
};

static v32qi biased_line = {
    247, 246, 246, 246, 246, 246, 246, 246,
    246, 246, 246, 246, 246, 246, 246, 246,
    247, 246, 246, 246, 246, 246, 246, 246,
    246, 246, 246, 246, 246, 246, 246, 246,
};

static v32qi incr_low_mask = {
    1, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0,
    1, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0,
};

static v32qi incr_high_mask = {
    0, 0, 0, 0, 0, 0, 0, 0,
    1, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0,
    1, 0, 0, 0, 0, 0, 0, 0,
};

#define OUTPUT_SIZE(n)      (94+16*n)
#define TEMPLATE_SIZE(n)    ((OUTPUT_SIZE(n) + 31) & ~31)
#define MAX         15

#define OUTBUF_SIZE     1048576

static uint8_t template[TEMPLATE_SIZE(MAX)];
static uint8_t *output1;
static uint8_t *output2;
static uint8_t *output;
#define BOUNDARY (output + OUTBUF_SIZE)

static v32qi mask[26];
static v32qi *mask_ptr;

static uint8_t code_buffer[64];
static uint8_t *code_ptr;

static uint8_t *jit_buffer;
static uint8_t *jit_ptr;
typedef uint8_t *jit_fn(uint8_t *, int);

/*
 * Bytecode language:
 *   0 = mov 32 bytes into temp buffer from the next mask
 *   1 = or 32 bytes into temp buffer from the next mask
 *   2 = increment the line
 *   3 = store 32 bytes from temp buffer
 *   -32..-1 = add n to the output pointer
 */
static void gen_prolog(void)
{
    code_ptr = code_buffer;
    mask_ptr = mask;
}

static void gen_epilog(int template_size)
{
    *code_ptr++ = 3;
    *code_ptr++ = (template_size & 31) - 32;
}

static void do_gen_or_code(int from)
{
    assert(mask_ptr - mask < sizeof(mask) / sizeof(mask[0]));

    // o[i++] |= out_bytes[template[from + i]];
    for (int i = 0; i < 32; i++) {
        uint8_t m = template[from + i];
        if (m < 16) {
            mask_ptr[0][i] = m;
            mask_ptr[1][i] = 0;
        } else {
            mask_ptr[0][i] = -1;
            mask_ptr[1][i] = hi_bytes[m - 16];
        }
    }
    *code_ptr++ = 1;
    mask_ptr += 2;
}

static void do_gen_mov_code(int from, int to)
{
    assert(mask_ptr - mask < sizeof(mask) / sizeof(mask[0]));

    // o[i++] = out_bytes[template[from + i]];
    for (int i = 0; i < 32; i++) {
        uint8_t m = (from + i > to) ? ZERO : template[from + i];
        if (m < 16) {
            mask_ptr[0][i] = m;
            mask_ptr[1][i] = 0;
        } else {
            mask_ptr[0][i] = -1;
            mask_ptr[1][i] = hi_bytes[m - 16];
        }
    }
    *code_ptr++ = 0;
    mask_ptr += 2;
}

static void gen_inc_code(void)
{
    *code_ptr++ = 2;
}

static void gen_out_code(int from, int to)
{
    int offset = from & ~31;
    if (offset < from) {
        assert(to >= offset + 32);
        do_gen_or_code(offset);
        offset += 32;
        *code_ptr++ = 3;
    }
    while (offset < to) {
        do_gen_mov_code(offset, to);
        offset += 32;
        if (offset <= to)
            *code_ptr++ = 3;
    }
    memset(template + from, ZERO, to - from);
}

static void inc_line(v32qi incr_mask)
{
    v32qi old = biased_line;
    v32qi incr = _mm256_add_epi64((__m256i)incr_mask, (__m256i)old);
    biased_line = _mm256_max_epu8(incr, biased_zero);
    lo_bytes += biased_line - old;
}

static v32qi do_shuffle(v32qi *mask)
{
    v32qi digits = __builtin_ia32_pshufb256(lo_bytes, mask[0]);
    return digits | mask[1];
}

static uint8_t *maybe_output(uint8_t *o)
{
    if (o > output + OUTBUF_SIZE) {
#if 1
        struct iovec iov = {output, OUTBUF_SIZE};
        do {
            ssize_t r = vmsplice(1, &iov, 1, 0);
            if (r < 0) {
                perror("vmsplice");
                exit(1);
            }
            iov.iov_base += r;
            iov.iov_len -= r;
        } while (iov.iov_len);
#else
        write(1, output, OUTBUF_SIZE);
#endif
        if (output == output1) {
            memcpy(output2, BOUNDARY, o - BOUNDARY);
            o = output2 + (o - BOUNDARY);
            output = output2;
        } else {
            memcpy(output1, BOUNDARY, o - BOUNDARY);
            o = output1 + (o - BOUNDARY);
            output = output1;
        }
    }
    return o;
}

static uint8_t *slow_run(uint8_t *o, int carry)
{
    const uint8_t *p;
    v32qi *m = mask;
    v32qi temp;
    for (p = code_buffer; p < code_ptr; p++) {
        uint8_t c = *p;
        if (c == 0) {
            temp = do_shuffle(m);
            m += 2;
        } else if (c == 1) {
            temp |= do_shuffle(m);
            m += 2;
        } else if (c == 3) {
            *(v32qi_u *)o = temp;
            o += 32;
        } else if (c == 2) {
            inc_line(incr_low_mask);
            if (--carry == 0)
                inc_line(incr_high_mask);
        } else {
            o += (int8_t) c;
        }
    }
    return maybe_output(o);
}

#define o(b) (*jit_ptr++ = (0x##b))
#define s(p) jit_ptr += ({ uint32_t x = (uintptr_t)p - (uintptr_t)(jit_ptr + 4); memcpy(jit_ptr, &x, 4); 4; })
#define d(i) jit_ptr += ({ uint32_t x = (i); memcpy(jit_ptr, &x, 4); 4; })

void compile(void)
{
    const uint8_t *p, *label;
    v32qi *m = mask;
    int ofs = 0;

    jit_ptr = jit_buffer;
    o(C5),o(FD),o(6F),o(05),s(&lo_bytes);      // vmovdqa ymm0, lo_bytes
    o(C5),o(FD),o(6F),o(15),s(&biased_line);   // vmovdqa ymm2, biased_line
    o(C5),o(FD),o(6F),o(1D),s(&biased_zero);   // vmovdqa ymm3, biased_zero
    o(C5),o(FD),o(6F),o(25),s(&incr_low_mask); // vmovdqa ymm4, incr_low_mask

    /* in inc_line, lo_bytes - old is always the same.  Put it in ymm1.  */
    o(C5),o(FD),o(F8),o(CA);           // vpsubb ymm1, ymm0, ymm2

    label = jit_ptr;
    for (p = code_buffer; p < code_ptr; p++) {
        uint8_t c = *p;
        if (c == 0) {
            o(C5),o(FD),o(6F),o(35),s(m);  // vmovdqa ymm6, MASK
            m++;
            o(C5),o(FD),o(6F),o(2D),s(m);  // vmovdqa ymm5, MASK
            m++;
            o(C4),o(E2),o(7D),o(00),o(F6); // vpshufb ymm6, ymm0, ymm6
            o(C5),o(D5),o(EB),o(EE);       // vpor ymm5, ymm5, ymm6
        } else if (c == 1) {
            o(C5),o(FD),o(6F),o(35),s(m);  // vmovdqa ymm6, MASK
            m++;
            o(C5),o(FD),o(6F),o(3D),s(m);  // vmovdqa ymm7, MASK
            m++;
            o(C4),o(E2),o(7D),o(00),o(F6); // vpshufb ymm6, ymm0, ymm6
            o(C5),o(D5),o(EB),o(EF);       // vpor ymm5, ymm5, ymm7
            o(C5),o(D5),o(EB),o(EE);       // vpor ymm5, ymm5, ymm6
        } else if (c == 3) {
            o(C5),o(FE),o(7F),o(AF),d(ofs); // vmovdqu [rdi+NNN], ymm5
            ofs += 32;
        } else if (c == 2) {
            o(C5),o(ED),o(D4),o(D4);      // vpaddq ymm2, ymm2, ymm4
            o(C5),o(ED),o(DE),o(D3);      // vpmaxub ymm2, ymm2, ymm3
            o(C5),o(F5),o(FC),o(C2);      // vpaddb ymm0, ymm1, ymm2
        } else {
            ofs += (int8_t) c;
        }
    }
    o(48),o(81),o(C7),d(ofs);   // add rdi, ofs
    o(FF),o(CE);                // dec esi
    o(0F),o(85),s(label);       // jnz label
    o(48),o(89),o(F8);          // mov rax, rdi
    o(C5),o(FD),o(7F),o(05),s(&lo_bytes);     // vmovdqa lo_bytes, ymm0
    o(C5),o(FD),o(7F),o(15),s(&biased_line);  // vmovdqa biased_line, ymm2
    o(C3);                // ret
}

#define INITIAL "1\n2\nFizz\n4\nBuzz\nFizz\n7\n8\nFizz\n"
#define TENS_FOR_VPADDQ (10000 * 10000)

int main()
{
    uint8_t shuffle[] = { 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0, 16, 26 };
    const uint8_t fizz[] = { 31, 30, 27, 27, 26 };
    const uint8_t fizzbuzz[] = { 31, 30, 27, 27, 29, 28, 27, 27, 26 };
    const uint8_t *buzz = fizzbuzz + 4;
    
    int l;
    uint64_t n;
    uint32_t tens_till_carry = TENS_FOR_VPADDQ - 1;

    output1 = mmap(NULL, OUTBUF_SIZE + 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANON, -1, 0);
    output2 = mmap(NULL, OUTBUF_SIZE + 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANON, -1, 0);
    output = output1;
    uint8_t *o = mempcpy(output, INITIAL, strlen(INITIAL));

    fcntl(1, F_SETPIPE_SZ, OUTBUF_SIZE);
    memset(template, ZERO, sizeof(template));

    jit_buffer = mmap(NULL, 16384, PROT_READ|PROT_WRITE|PROT_EXEC,
              MAP_32BIT|MAP_PRIVATE|MAP_ANON, -1, 0);
    assert((uintptr_t)mask <= 0x7FFFFFFF);
    assert((uintptr_t)jit_buffer <= 0x7FFFFFFF);

    for (l = 2, n = 3; l <= MAX; l++, n = n * 10) {
        int output_size = OUTPUT_SIZE(l);
        int template_size = TEMPLATE_SIZE(l);

        uint8_t *s = shuffle + sizeof(shuffle) - l - 1;
        uint8_t *p = template;

#define ZERO_UNITS s[l - 1] = 16;
#define INC_UNITS s[l - 1]++;

        ZERO_UNITS; p = mempcpy(p, buzz, 5); // 10
        INC_UNITS; p = mempcpy(p, s, l + 1); // 11
        INC_UNITS; p = mempcpy(p, fizz, 5); // 12
        INC_UNITS; p = mempcpy(p, s, l + 1); // 13
        INC_UNITS; p = mempcpy(p, s, l + 1); // 14
        INC_UNITS; p = mempcpy(p, fizzbuzz, 9); // 15
        INC_UNITS; p = mempcpy(p, s, l + 1); // 16
        INC_UNITS; p = mempcpy(p, s, l + 1); // 17
        INC_UNITS; p = mempcpy(p, fizz, 5); // 18
        INC_UNITS; p = mempcpy(p, s, l + 1); // 19
        ZERO_UNITS; p = mempcpy(p, buzz, 5); // 20
        INC_UNITS; p = mempcpy(p, fizz, 5); // 21
        INC_UNITS; p = mempcpy(p, s, l + 1); // 22
        INC_UNITS; p = mempcpy(p, s, l + 1); // 23
        INC_UNITS; p = mempcpy(p, fizz, 5); // 24
        INC_UNITS; p = mempcpy(p, buzz, 5); // 25
        INC_UNITS; p = mempcpy(p, s, l + 1); // 26
        INC_UNITS; p = mempcpy(p, fizz, 5); // 27
        INC_UNITS; p = mempcpy(p, s, l + 1); // 28
        INC_UNITS; p = mempcpy(p, s, l + 1); // 29
        ZERO_UNITS; p = mempcpy(p, fizzbuzz, 9); // 30
        INC_UNITS; p = mempcpy(p, s, l + 1); // 31
        INC_UNITS; p = mempcpy(p, s, l + 1); // 32
        INC_UNITS; p = mempcpy(p, fizz, 5); // 33
        INC_UNITS; p = mempcpy(p, s, l + 1); // 34
        INC_UNITS; p = mempcpy(p, buzz, 5); // 35
        INC_UNITS; p = mempcpy(p, fizz, 5); // 36
        INC_UNITS; p = mempcpy(p, s, l + 1); // 37
        INC_UNITS; p = mempcpy(p, s, l + 1); // 38
        INC_UNITS; p = mempcpy(p, fizz, 5); // 39
        memset(p, ZERO, template + template_size - p);

        gen_prolog();
        gen_out_code(0, 30+6*l);
        gen_inc_code();
        gen_out_code(30+6*l, 60+11*l);
        gen_inc_code();
        gen_out_code(60+11*l, 94+16*l);
        gen_inc_code();
        gen_epilog(94+16*l);

        compile();

        uint64_t left = n;
        do {
            int runs;
            if (tens_till_carry <= 3) {
                if (tens_till_carry == 0) {
                    inc_line(incr_high_mask);
                    runs = 0;
                } else {
                    o = slow_run(o, tens_till_carry);
                    runs = 1;
                }
                tens_till_carry += TENS_FOR_VPADDQ;
            } else {
                runs = (BOUNDARY - o) / output_size + 1;
                if (runs > left)
                    runs = left;
                if (runs * 3 > tens_till_carry)
                    runs = tens_till_carry / 3;

                o = ((jit_fn *)jit_buffer) (o, runs);
                o = maybe_output(o);
            }
            left -= runs;
            tens_till_carry -= runs * 3;
        } while (left);
    }
    write(1, output, o - output);
}

Last minute add: here is how the number of source code lines grew as the bells and whistles were added. Handling the pesky carry is more work than the JIT compiler!

227 basic implementation
252 rewrite mask operations as AVX2
253 add store bytecode
258 move loop inside the run function
276 increment line number using AVX
332 handle carry over 10^8
377 generate custom AVX2 code for each length
404 use the "vmsplice" system call

Welcome to Code Golf, and fantastic first answer! I don't know much about this, but I'm curious to see how fast it ends up being. — Aaroneous Miller, Commented Oct 29, 2021 at 16:51
Stable 23.3GiB/s! had to compile with gcc -mavx2 -flax-vector-conversions paolo.c, is that alright? — Omer Tuchfeld, Commented Oct 29, 2021 at 17:39
Yet again, I miss(understood) the taskset note. 40GiB/s with taskset — Omer Tuchfeld, Commented Oct 29, 2021 at 17:52
I suspect their calculation of the L2 size is not optimal on my computer, and my choice of the constant 1048576 might not be the best on yours. On my laptop my program is actually faster, so I had hopes (!), but the winning answer is awesome and they deserve first place. I'm happy I could get so close with my relatively simple algorithm, that I could implement in my spare time over less than a week. — Paolo Bonzini, Commented Oct 29, 2021 at 22:18
L2 cache size on most non-server CPUs is 256KiB per-core private for Intel. AMD since Zen 1 has 512K per core. (Skylake-AVX512 has 1MiB L2.) Since it's a unified cache, it tends to have some code. And also page table data pulled in by the page-walker, and some extra lines from HW prefetch. So generally you don't want to size your buffer to fill L2; often 128KiB is a good choice in practice, half of the smallest common L2 size. Or 3/4 if you're optimistic and tuning for the best-case with little competition from interrupt handlers and other hyperthread. — Peter Cordes, Commented Aug 10, 2022 at 19:09

Aiden4 · Accepted Answer · 2021-01-08 17:14:19Z

Coded in rust- modern languages can be fast too. Build with cargo build --release* and run with ./target/release/fizz_buzz. The count goes up by 15 every iteration of the loop. The itoap crate is used to quickly write integers to the buffer. Adds 15 line chunks to an array unless there isn't enough space left in the buffer for a max-sized chunk, and when that happens it flushes the buffer to stdout.

main.rs:

use std::io::*;
use itoap::Integer;
const FIZZ:*const u8 = "Fizz\n".as_ptr();
const BUZZ:*const u8 = "Buzz\n".as_ptr();
const FIZZBUZZ:*const u8 = "FizzBuzz\n".as_ptr();
const BUF_SIZE:usize = 1024*256;
const BLOCK_SIZE:usize = 15 * i32::MAX_LEN;
/// buf.len() > count
macro_rules! itoap_write{
  ($buf:ident,$count:ident,$num:ident)=>{
    $count += itoap::write_to_ptr(
      $buf.get_unchecked_mut($count..).as_mut_ptr(),
      $num
    );
    $buf.as_mut_ptr().add($count).write(b'\n');
    $count += 1;
  }
}
///ptr must be valid, buf.len() > count, ptr.add(len) must not overflow buffer
macro_rules! str_write{
  ($buf:ident,$count:ident,$ptr:ident,$len:literal)=>{
    let ptr = $buf.get_unchecked_mut($count..).as_mut_ptr();
    ptr.copy_from_nonoverlapping($ptr,$len);
    $count += $len;
  }
}

fn main() -> Result<()>{
  let mut write = stdout();
  let mut count:usize = 0;
  let mut buf = [0u8;BUF_SIZE];
  let mut i:i32 = -1;
  loop{
    if &count + &BLOCK_SIZE > BUF_SIZE{
      unsafe{
        write.write_all(
          buf.get_unchecked(..count)
        )?;
      }
      count = 0;
    } 
    i += 2;
    unsafe{
      itoap_write!(buf,count,i);     
      i += 1;
      itoap_write!(buf,count,i);
      str_write!(buf,count,FIZZ,5);
      i += 2;
      itoap_write!(buf,count,i);
      str_write!(buf,count,BUZZ,5);
      str_write!(buf,count,FIZZ,5);
      i += 3;
      itoap_write!(buf,count,i);
      i += 1;
      itoap_write!(buf,count,i);
      str_write!(buf,count,FIZZ,5);
      str_write!(buf,count,BUZZ,5);
      i += 3;
      itoap_write!(buf,count,i);
      str_write!(buf,count,FIZZ,5);
      i += 2;
      itoap_write!(buf,count,i);
      i += 1;
      itoap_write!(buf,count,i);
      str_write!(buf,count,FIZZBUZZ,9);
    }
  }
}

Cargo.toml:

[package]
name = "fizz_buzz"
version = "0.1.0"
authors = ["aiden4"]
edition = "2018"

[dependencies]
itoap = "0.1"
[[bin]]
name = "fizz_buzz"
path = "main.rs"

[profile.release]
lto = "fat"

*requires cargo to be able to connect to the internet

Aiden4 · Accepted Answer · 2021-11-11 21:47:42Z

Python3

Interesting problem. I see most answers used static languages, except the Powershell answer, so far untimed. As I was curious about how dynamic languages would fare in this task, I wrote a simple implementation in Python.

I ran it under GNU/Linux Mint 20.02 64-bit, using default Python3 (3.8.10) and pypy (7.3.1) in the repositories. Processor model in my machine: AMD Athlon(tm) X4 750 Quad Core.

Initial version

The initial version just keeps a carousel running in sync to a counter, where the carousel carries either False or the results different from the current count, in the proper positions. The derailleur function selects what is to be printed.

from itertools import cycle, count

def derailleur(counter, carousel):
    if not carousel:
        return counter
    return carousel

def main():
    carousel = cycle([0,0,'Fizz',0,'Buzz','Fizz',0,0,'Fizz','Buzz',0,'Fizz',0,0,'FizzBuzz'])
    counter = count(1)
    f = map(print, map(derailleur, counter, carousel))
    while 1:
        next(f)

main()

In my machine it ran at about 14,2MiB/s under CPython, and got a modest boost up to 25,0MiB/s under pypy:

user@Desktop:~$ python3 fizzbuzz.py | pv > /dev/null
^C21MiB 0:00:30 [14,2MiB/s] [                            <=>                   ]
Traceback (most recent call last):
  File "fizzbuzz.py", line 15, in <module>
    main()
  File "fizzbuzz.py", line 13, in main
    next(f)
  File "fizzbuzz.py", line 3, in derailleur
    def derailleur(counter, carousel):
KeyboardInterrupt
Exception ignored in: <_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>
BrokenPipeError: [Errno 32] Broken pipe

user@Desktop:~$ pypy3 fizzbuzz.py | pv > /dev/null
^C57MiB 0:00:30 [25,0MiB/s] [                            <=>                   ]
Traceback (most recent call last):
  File "fizzbuzz.py", line 15, in <module>
    main()
  File "fizzbuzz.py", line 13, in main
    next(f)
KeyboardInterrupt

Chunking version

Afterwards I modified the initial version to print the results in chunks, instead of one by one.

from itertools import cycle, count

def derailleur(counter, carousel):
    if not carousel:
        return counter
    return carousel

def main():
    carousel = cycle([0,0,'Fizz',0,'Buzz','Fizz',0,0,'Fizz','Buzz',0,'Fizz',0,0,'FizzBuzz'])
    counter = map(str, count(1))
    f = map(derailleur, counter, carousel)
    while 1:
        print('\n'.join([next(f) for _ in range(256)]))

main()

Now under CPython it runs a bit faster, at about 19,6MiB/s. But under pypy it receives a large boost, achieving 84,9MiB/s, within half the speed of the naive implementation written in C reported by the OP (170MiB/s):

user@Desktop:~$ python3 chunking_fizzbuzz.py | pv > /dev/null
^C84MiB 0:00:30 [19,6MiB/s] [                            <=>                   ]
Traceback (most recent call last):
  File "chunking_fizzbuzz.py", line 15, in <module>
    main()
  File "chunking_fizzbuzz.py", line 13, in main
    print('\n'.join([next(f) for _ in range(256)]))
KeyboardInterrupt

user@Desktop:~$ pypy3 chunking_fizzbuzz.py | pv > /dev/null
^C49GiB 0:00:30 [84,9MiB/s] [                            <=>                   ]
Traceback (most recent call last):
  File "chunking_fizzbuzz.py", line 15, in <module>
    main()
  File "chunking_fizzbuzz.py", line 13, in main
    print('\n'.join([next(f) for _ in range(256)]))
  File "chunking_fizzbuzz.py", line 13, in <listcomp>
    print('\n'.join([next(f) for _ in range(256)]))
KeyboardInterrupt

I'm curious to see how much this result can be improved upon, for Python, and how other dynamic languages perform.

\$\begingroup\$ Welcome to Code Golf and nice first answer! \$\endgroup\$
– Aaroneous Miller
Commented Nov 11, 2021 at 20:41 — Aaroneous Miller, Commented Nov 11, 2021 at 20:41

Timo Kluck · Accepted Answer · 2023-11-19 00:22:53Z

Julia (v1.10)

Run with julia fizzbuzz.jl | pv > /dev/null.

I'm getting 8-10 GiB/s throughput.

Notes:

This supports at most 16 digits for the line number, which on my machine theoretically takes a day to reach.
This uses vmsplice() as popularized by ais523.
The buffer and page size count are hardcoded for my own machine, a Dell XPS with a 4 core / 8 threads Core i7-10510U CPU @ 1.80GHz.

const PAGESIZE = 4096
const L2_CACHE_SIZE = 256 * PAGESIZE
const BUFSIZE = L2_CACHE_SIZE ÷ 2

"""
    ShortString("foo")

Represents a string that's short enough to fit entirely in a UInt128.
We take advantage of that by doing arithmetic on the UInt128 for
enumerating the decimal representation of the line numbers.
"""
struct ShortString
  val :: UInt128
  len :: Int
end

ShortString(s::String) = begin
  @assert length(s) <= sizeof(UInt128)
  s_padded = s * "\0" ^ sizeof(UInt128)
  val = unsafe_load(Ptr{UInt128}(pointer(s_padded)))
  ShortString(val, length(s))
end

Base.length(s::ShortString) = s.len

Base.:+(s::ShortString, x::Integer) = ShortString(s.val + x, s.len)
Base.:-(a::ShortString, b::ShortString) = begin
  @assert length(a) == length(b)
  a.val - b.val
end

concat(s::ShortString, a::Char) = begin
  newval = (s.val << 8) | UInt8(a)
  ShortString(newval, s.len + 1)
end

"""
    StaticBuffer(size)

Represents a simple byte array together with its next index.

This struct is non-mutable, and instead of updating `ptr` in place, we
replace it with a new StaticBuffer (see the `put` implementation).
This has experimentally been much faster; I think the compiler can apply
more optimizations when it keeps the struct on the stack.
"""
struct StaticBuffer
  buf :: Vector{UInt8}
  ptr :: Ptr{UInt128}
end

StaticBuffer(size) = begin
  buf = Vector{UInt8}(undef, size)
  ptr = pointer(buf)
  StaticBuffer(buf, ptr)
end

Base.length(buffer::StaticBuffer) = buffer.ptr - pointer(buffer.buf)
Base.pointer(buffer::StaticBuffer) = buffer.ptr
Base.truncate(buffer::StaticBuffer) = StaticBuffer(buffer.buf, pointer(buffer.buf))

put(buffer::StaticBuffer, s::ShortString) = begin
  unsafe_store!(buffer.ptr, s.val)
  StaticBuffer(buffer.buf, buffer.ptr + s.len)
end

almostfull(buffer::StaticBuffer) = begin
  length(buffer.buf) - (buffer.ptr - pointer(buffer.buf)) < PAGESIZE
end

"""
    withpipefd(f, io::IO, args...; kwds...)

Run `f` with a file descriptor (`::RawFD`) that is known to be a pipe; if `io`
isn't a pipe already, we insert a dummy `cat` process. This allows us to use
`vmsplice` which is much faster in the benchmark setup than `write`.
"""
withpipefd(f, io::Base.PipeEndpoint, args...; kwds...) = f(Base._fd(io), args...; kwds...)
withpipefd(f, io::Base.IOContext, args...; kwds...) = withpipefd(f, io.io, args...; kwds...)
withpipefd(f, io, args...; kwds...) = begin
  process = open(pipeline(`cat`, stdout=io), write=true)
  withpipefd(f, process.in, args...; kwds...)
  close(process)
end

"""
    vmsplice(fdesc, buffer)

Splice the data in `buffer` to the pipe in `fdesc`.
"""
vmsplice(fdesc::RawFD, buffer::StaticBuffer) = begin
  ptr = pointer(buffer.buf)
  while ptr < buffer.ptr
    written = @ccall vmsplice(
      fdesc :: Cint,
      (ptr, buffer.ptr - ptr) :: Ref{Tuple{Ref{UInt8}, Csize_t}},
      1 :: Csize_t,
      0 :: Cuint) :: Cssize_t
    if written < 0
      error("Couldn't write to pipe")
    end
    ptr += written
  end
end

"""
    asciidigits, intdigits = nextline(asciidigits, intdigits, plusone)

Move asciidigits and intdigits to the next line, i.e. add 1
to the ascii and decimal representations.
"""
@inline nextline(asciidigits, intdigits, plusone) = begin
  asciidigits += plusone
  intdigits = Base.setindex(intdigits, intdigits[1] + 1, 1)
  asciidigits, intdigits
end

const CARRY = ShortString("20") - ShortString("1:")

"""
    asciidigits, plusone, pluscarry = carry(position, asciidigits, plusone, pluscarry)

Perform a carry operation on asciidigits in the `position`th decimal place.
"""
@inline carry(position, asciidigits, plusone, pluscarry) = begin
  if position + 1 == length(asciidigits)
    asciidigits = concat(asciidigits, '0')

    plusone <<= 8
    pluscarry = pluscarry .<< 8
    pluscarry = Base.setindex(pluscarry, CARRY, position)
  end
  asciidigits += pluscarry[position]
  asciidigits, plusone, pluscarry
end

"""
    @compiletime for a in b
      <statements>
    end

Unroll the loop.
"""
macro compiletime(forloop)
  @assert forloop.head == :for
  it, body = forloop.args
  @assert it.head == :(=)
  lhs, rhs = it.args

  expressions = gensym(:expressions)

  body = quote
    push!($expressions, $(Expr(:quote, body)))
  end

  expressions = Core.eval(__module__, quote
    let $expressions = []
      for $lhs in $rhs
        $body
      end
      $expressions
    end
  end)

  return esc(quote
    $(expressions...)
  end)
end

"""
    asciidigits, intdigits, plusone, pluscarry = maybecarry(asciidigits, intdigits, plusone, pluscarry)

If necessary, perform a carry operation on asciidigits and intdigits.
"""
@inline maybecarry(asciidigits, intdigits, plusone, pluscarry) = begin
  asciidigits += plusone

  @compiletime for d in 1:16
    intdigits = Base.setindex(intdigits, intdigits[$d] + 1, $d)
    intdigits[$d] != 10 && @goto carried
    intdigits = Base.setindex(intdigits, 0, $d)
    asciidigits, plusone, pluscarry = carry($d, asciidigits, plusone, pluscarry)
  end

  intdigits = Base.setindex(intdigits, intdigits[17] + 1, 17)
  intdigits[17] >= 10 && error("too big!")

  @label carried
  asciidigits, intdigits, plusone, pluscarry
end

const FIZZ = ShortString("Fizz\n")
const BUZZ = ShortString("Buzz\n")
const FIZZBUZZ = ShortString("FizzBuzz\n")

initialstate() = (
  intdigits = (1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0),
  asciidigits = ShortString("1\n"),
  plusone = UInt128(1),
  pluscarry = ntuple(_ -> zero(UInt128), Val(sizeof(UInt128)))
)

fizzbuzz(buffer::StaticBuffer, state) = begin
  (;intdigits, asciidigits, plusone, pluscarry) = state

  while !almostfull(buffer)
    buffer = put(buffer, asciidigits)

    asciidigits, intdigits = nextline(asciidigits, intdigits, plusone)
    buffer = put(buffer, asciidigits)

    asciidigits, intdigits = nextline(asciidigits, intdigits, plusone)
    buffer = put(buffer, FIZZ)

    asciidigits, intdigits = nextline(asciidigits, intdigits, plusone)
    buffer = put(buffer, asciidigits)

    asciidigits, intdigits, plusone, pluscarry = maybecarry(asciidigits, intdigits, plusone, pluscarry)
    buffer = put(buffer, BUZZ)

    asciidigits, intdigits = nextline(asciidigits, intdigits, plusone)
    buffer = put(buffer, FIZZ)

    asciidigits, intdigits = nextline(asciidigits, intdigits, plusone)
    buffer = put(buffer, asciidigits)

    asciidigits, intdigits = nextline(asciidigits, intdigits, plusone)
    buffer = put(buffer, asciidigits)

    asciidigits, intdigits = nextline(asciidigits, intdigits, plusone)
    buffer = put(buffer, FIZZ)

    asciidigits, intdigits, plusone, pluscarry = maybecarry(asciidigits, intdigits, plusone, pluscarry)
    buffer = put(buffer, BUZZ)

    asciidigits, intdigits = nextline(asciidigits, intdigits, plusone)
    buffer = put(buffer, asciidigits)

    asciidigits, intdigits = nextline(asciidigits, intdigits, plusone)
    buffer = put(buffer, FIZZ)

    asciidigits, intdigits = nextline(asciidigits, intdigits, plusone)
    buffer = put(buffer, asciidigits)

    asciidigits, intdigits = nextline(asciidigits, intdigits, plusone)
    buffer = put(buffer, asciidigits)

    asciidigits, intdigits, plusone, pluscarry = maybecarry(asciidigits, intdigits, plusone, pluscarry)
    buffer = put(buffer, FIZZBUZZ)

    asciidigits, intdigits = nextline(asciidigits, intdigits, plusone)
  end
  buffer, (;intdigits,asciidigits,plusone,pluscarry)
end

fizzbuzz(fdesc::RawFD, cutoff=typemax(Int)) = begin
  pipesize = @ccall fcntl(fdesc::Cint, 1031::Cint, BUFSIZE::Cint)::Cint
  @assert pipesize == BUFSIZE

  buf1, buf2 = StaticBuffer(BUFSIZE), StaticBuffer(BUFSIZE)

  state = initialstate()
  n = 0

  @GC.preserve buf1 buf2 while n <= cutoff

    buf1, state = fizzbuzz(truncate(buf1), state)
    vmsplice(fdesc, buf1)
    n += length(buf1)


    buf2, state = fizzbuzz(truncate(buf2), state)
    vmsplice(fdesc, buf2)
    n += length(buf2)
  end
end

"""
  fizzbuzz(io::IO, cutoff=typemax(Int))

Write the fizzbuzz output to `io`.

The `cutoff` parameter is approximate; depending on buffering, more bytes
may be written to `io`.
"""
fizzbuzz(io::IO, cutoff=typemax(Int)) = withpipefd(fizzbuzz, io, cutoff)

if abspath(PROGRAM_FILE) == @__FILE__
  fizzbuzz(stdout)
end

\$\begingroup\$ Welcome to Code Golf, and nice answer! \$\endgroup\$
– Rydwolf Programs
Commented Dec 28, 2022 at 16:35 — Rydwolf Programs, Commented Dec 28, 2022 at 16:35

xiver77 · Accepted Answer · 2022-06-08 10:36:52Z

Update

A minor change to remove a branch, and some code cleanup, ~~about 10% speedup on my machine~~.

Because the SysV calling convention sets all xmm registers as clobbers across function calls, and this includes a call to the vmsplice wrapper function, I had to make a manual syscall to vmsplice with inline assembly to tell the compiler that there are no xmm clobbers, so that the compiler can preload more values in xmm registers. This was the main source of speedup. Let me know if there's a better way.

The inline assembly part is now commented out because Omer (who opened the challenge) thinks inline assembly makes the code no longer pure C. It's best for me to compete with C entries.

This program will run faster than any other C or C++ entry, at least twice the speed. It is also noticeably faster than the last edit which was a bit buggy (blame vmplice). gcc as a compiler and SSE4.1 is a requirement.

There are two main optimizations that make this program run fast.

First, the digits are stored in a single XMM register with 128-bits, and since each character consumes 8-bits, one register can hold maximum 16 digits, which is more than enough. All operations to digits are done with one or several SIMD instructions, and the data always stays in a register before it gets copied to the buffer.

The PSHUFB instruction or _mm_shuffle_epi8 is the main source of speedup. It does the otherwise complicated byte reversal and shifting all bytes to one direction, in a single instruction. Both happen in the writeDigits function.

Second is vmsplice. I do not like the person who wrote this Linux-specific system call and probably the same person who wrote the documentation very badly. All would've been simple if such poorly-documented and unpredictable syscall didn't exist at all. I was really forced to use it because the speedup it could provide was too big. If you ever consider using this syscall, the following notes may help.

The safest and probably intended use of vmplice is the sequence of mmap -> vmplice with SPLICE_F_GIFT -> munmap. You can call munmap directly after vmsplice retruns. This works in a similar way to asynchronous IO, and is very efficient in that sense. However since you are mmaping a new buffer every time, the buffer will unlikely be in cache.
If you overwrite the buffer after a call to vmsplice, things can get very unpredictable. Whether SPLICE_F_GIFT is set or not doesn't make a difference, and I'm not even sure whether that flag does a thing at all. That vmsplice has returned does not mean that the pipe has consumed all of the buffer. That's why I said it is similar to asynchronous IO. It is not documented at all when it is safe to overwrite the buffer after vmsplice has returned. All I can say is it is somehow predictable when,
1. vmsplice is always called with the same buffer size.
2. The pipe size matches the buffer size.
3. The same thread writes to the buffer and calls vmsplice.
4. The buffer is not overwritten too fast... and I don't know how much is too fast, but for example if you overwrite the buffer right after vmsplice returns, things get unpredictable.

Compile the program with gcc -s -O3 -msse4.1 -fwhole-program.

#define _GNU_SOURCE
#include <stdlib.h>
#include <stdint.h>
#include <stdalign.h>
#include <string.h>
#include <fcntl.h>
#include <smmintrin.h>

static const __m128i shiftMask[] = {
  {0x0706050403020100, 0x0f0e0d0c0b0a0908},
  {0x0807060504030201, 0xff0f0e0d0c0b0a09},
  {0x0908070605040302, 0xffff0f0e0d0c0b0a},
  {0x0a09080706050403, 0xffffff0f0e0d0c0b},
  {0x0b0a090807060504, 0xffffffff0f0e0d0c},
  {0x0c0b0a0908070605, 0xffffffffff0f0e0d},
  {0x0d0c0b0a09080706, 0xffffffffffff0f0e},
  {0x0e0d0c0b0a090807, 0xffffffffffffff0f},
  {0x0f0e0d0c0b0a0908, 0xffffffffffffffff},
  {0xff0f0e0d0c0b0a09, 0xffffffffffffffff},
  {0xffff0f0e0d0c0b0a, 0xffffffffffffffff},
  {0xffffff0f0e0d0c0b, 0xffffffffffffffff},
  {0xffffffff0f0e0d0c, 0xffffffffffffffff},
  {0xffffffffff0f0e0d, 0xffffffffffffffff},
  {0xffffffffffff0f0e, 0xffffffffffffffff},
  {0xffffffffffffff0f, 0xffffffffffffffff}
};

static __m128i inc(__m128i d) {
  return _mm_sub_epi64(d, _mm_set_epi64x(0, -1));
}

static __m128i carry(__m128i d) {
  d = _mm_sub_epi64(d,
    _mm_bslli_si128(_mm_cmpeq_epi64(d, _mm_setzero_si128()), 8));
  return _mm_or_si128(d,
    _mm_and_si128(_mm_cmpeq_epi8(d, _mm_setzero_si128()), _mm_set1_epi8(0xf6)));
}

static int writeDigits(char *b, __m128i d, int i) {
  _mm_storeu_si128((__m128i *)b,
    _mm_shuffle_epi8(
      _mm_shuffle_epi8(_mm_sub_epi64(d, _mm_set1_epi8(0xc6)),
        _mm_set_epi8(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)),
      shiftMask[i]));
  b[16 - i] = '\n';
  return 17 - i;
}

static int writeFizz(void *b) {
  memcpy(b, &(int64_t){0x0a7a7a6946}, 8);
  return 5;
}

static int writeFizzBuzz(void *b) {
  memcpy(b, &(int64_t){0x7a7a75427a7a6946}, 8);
  ((char *)b)[8] = '\n';
  return 9;
}

static int writeFizzAndBuzz(void *b) {
  memcpy(b, &(int64_t){0x7a75420a7a7a6946}, 8);
  memcpy(b + 8, &(int16_t){0x0a7a}, 2);
  return 10;
}

static int writeBuzzAndFizz(void *b) {
  memcpy(b, &(int64_t){0x7a69460a7a7a7542}, 8);
  memcpy(b + 8, &(int16_t){0x0a7a}, 2);
  return 10;
}

static void memcpy_simple(void *d, void *s, int n) {
  int i = 0;
  do {
    _mm_store_si128((void *)((char *)d + i),
      _mm_load_si128((void *)((char *)s + i)));
  } while ((i += 16) < n);
}

#define ALIGN 0x1000
#define SIZE (ALIGN << 8)
#define I d = inc(d)
#define IC d = carry(inc(d))
#define D I; p += writeDigits(p, d, i)
#define F I; p += writeFizz(p)
#define FB I; p += writeFizzBuzz(p)
#define FNB I; I; p += writeFizzAndBuzz(p)
#define BNF I; I; p += writeBuzzAndFizz(p)

int main() {
  if (fcntl(1, F_SETPIPE_SZ, SIZE) != SIZE) {
    abort();
  }
  alignas(ALIGN) char b[2][SIZE + ALIGN];
  int f = 0;
  char *p = b[f];
  __m128i d = _mm_set1_epi8(0xf6);
  int i = 15;
  for (int64_t j = 10, k = 10;; j += 30) {
    D; D; F; D; BNF; D; D; I; IC; p += writeFizzAndBuzz(p);
    if (j == k) {
      k *= 10;
      --i;
    }
    D; F; D; D; FB; D; D; F; D; IC; p += writeBuzzAndFizz(p);
    I; D; D; FNB; D; F; D; D; IC; p += writeFizzBuzz(p);
    int n = p - b[f] - SIZE;
    if (n >= 0) {
      struct iovec v = {b[f], SIZE};
      do {
        /*register long rax __asm__ ("rax") = 278;
        register long rdi __asm__ ("rdi") = 1;
        register long rsi __asm__ ("rsi") = (long)&v;
        register long rdx __asm__ ("rdx") = 1;
        register long r10 __asm__ ("r10") = 0;
        __asm__ ("syscall" : "+r"(rax) : "r"(rdi), "r"(rsi), "r"(rdx), "r"(r10)
        : "rcx", "r11");*/
        long rax = vmsplice(1, &v, 1, 0);
        if (rax < 0) {
          abort();
        }
        v.iov_base = (char *)v.iov_base + rax;
        v.iov_len -= rax;
      } while (v.iov_len);
      f = !f;
      memcpy_simple(b[f], b[!f] + SIZE, n);
      p = b[f] + n;
    }
  }
  return 0;
}

"I'm looking for a good way to test whether my program's output is correct" - See github.com/omertuc/fizzgolf/tree/main#verifying-submission — Omer Tuchfeld, Commented Dec 23, 2021 at 16:36
@OmerTuchfeld Thanks for providing a testing module. I also made a pull request in github with the updated code. — xiver77, Commented Dec 24, 2021 at 3:51
@OmerTuchfeld I couldn't run the test with the updated code. Now it requires to be piped to pv or cat like the asm entries. I tried adding taskset to the Makefile, but still coudn't run the test. The code runs fine on its own, but most certainly I'm doing something wrong using the test module. — xiver77, Commented Dec 24, 2021 at 4:19
I believe the taskset commands are only there to improve performance, not to improve vmsplice compatibility. Not entirely sure why the test doesn't work for you because it works perfectly fine with the 2 asm entries which also use vmsplice. I'll have to look a bit into how vmsplice works then either fix your submission or fix the tester to make them compatible — Omer Tuchfeld, Commented Dec 24, 2021 at 12:07
Okay seems like the tester not working on your newer version has nothing to do with vmsplice - the issue is that the tester is searching for the bytes \n1\n to detect the beginning of the fizzbuzz output (as opposed to junk compiler / make output) but your newer version starts printing numbers from 174871 rather than 1, so it never finds that string and ends up looping forever on it — Omer Tuchfeld, Commented Dec 24, 2021 at 14:56

anatolyg · Accepted Answer · 2020-11-17 14:04:55Z

My code works on Windows 10. It outputs 8-9 GiB/s when the CPU is cool enough.

I used the following ideas in my code:

Filling a buffer 256 KiB and sending it to output; for smaller buffer size the performance suffers; bigger buffer sometimes improves performance, but never by much.
For numbers which have the same number of digits, it works in chunks of 15 output lines. These chunks have identical length. While the size of the output buffer is big enough, it copies the previous chunk and adds 15 to the ASCII representation of all the numbers in it.
Near the end of the buffer and for first chunk, it calculates the output messages explicitly. Also, if numbers in the chunk have different length (e.g. 9999 and 10000).
It uses OpenMP to calculate 4 chunks simultaneously. I set NUM_THREADS to 4 (best on my computer, which has 8 logical cores); a larger setting might be better.

When I want to verify the output, I set check_file = 1 in code; if check_file = 0, it writes to NUL, which is the null output device on Windows.

#include <stdlib.h>
#include <stdio.h>
#include <math.h>
#include <inttypes.h>
#include <assert.h>
#include "windows.h"
#include "fileapi.h"
#include "process.h"

int size_15(int num_digits) // size of 15 messages, where numbers have a given number of digits
{
    return 8 * num_digits + 47;
}

int num_of_digits(int64_t counter) // number of digits
{
    int result = 1;
    if (counter >= 100000000)
    {
        counter /= 100000000;
        result += 8;
    }
    if (counter >= 10000)
    {
        counter /= 10000;
        result += 4;
    }
    if (counter >= 100)
    {
        counter /= 100;
        result += 2;
    }
    if (counter >= 10)
        return result + 1;
    else
        return result;
}

void print_num(char* buf, int64_t counter, int num_digits)
{
    for (int i = 0; i < num_digits; ++i)
    {
        buf[num_digits - 1 - i] = counter % 10 + '0';
        counter /= 10;
    }
}

void add_15_to_decimal_num(char* p, int num_digits)
{
    char digit = p[num_digits - 1] + 5;
    int c = (digit > '9');
    p[num_digits - 1] = (char)(digit - c * 10);
    c += 1;
    for (int i = 1; i < num_digits; ++i)
    {
        if (c == 0)
            break;
        digit = (char)(p[num_digits - 1 - i] + c);
        c = digit > '9';
        p[num_digits - 1 - i] = (char)(digit - c * 10);
    }
}

uint64_t fill_general(char* buf, int size, uint64_t counter, int* excess)
{
    char* p = buf;
    while (p < buf + size)
    {
        int fizz = counter % 3 == 0;
        int buzz = counter % 5 == 0;
        if (fizz && buzz)
        {
            memcpy(p, "FizzBuzz\n", 9);
            p += 9;
        }
        else if (fizz)
        {
            memcpy(p, "Fizz\n", 5);
            p += 5;
        }
        else if (buzz)
        {
            memcpy(p, "Buzz\n", 5);
            p += 5;
        }
        else
        {
            int num_digits = num_of_digits(counter);
            print_num(p, counter, num_digits);
            p[num_digits] = '\n';
            p += num_digits + 1;
        }
        ++counter;
    }
    *excess = (int)(p - (buf + size));
    return counter;
}

void fill15(char* buf, int64_t counter, int num_digits, int num_ofs[8])
{
    char* p = buf;
    int m15 = counter % 15;
    for (int i = m15; i < m15 + 15; ++i)
    {
        if (i % 15 == 0)
        {
            memcpy(p, "FizzBuzz\n", 9);
            p += 9;
        }
        else if (i % 3 == 0)
        {
            memcpy(p, "Fizz\n", 5);
            p += 5;
        }
        else if (i % 5 == 0)
        {
            memcpy(p, "Buzz\n", 5);
            p += 5;
        }
        else
        {
            *num_ofs++ = (int)(p - buf);
            print_num(p, counter + i - m15, num_digits);
            p += num_digits;
            *p++ = '\n';
        }
    }
}

// memcpy replacement; works only for sizes equal to 47 + 8 * n, for small n
void copy_47_8n(char* src, unsigned size)
{
    char* dst = src + size;

    memcpy(dst, src, 47);
    size -= 47;
    dst += 47;
    src += 47;

    if (size >= 128)
        exit(1);
    if (size >= 96)
        memcpy(dst + 64, src + 64, 32);
    if (size >= 64)
        memcpy(dst + 32, src + 32, 32);
    if (size >= 32)
        memcpy(dst + 0, src + 0, 32);
    dst += size / 32 * 32;
    src += size / 32 * 32;
    size %= 32;
    if (size >= 24)
        memcpy(dst + 16, src + 16, 8);
    if (size >= 16)
        memcpy(dst + 8, src + 8, 8);
    if (size >= 8)
        memcpy(dst + 0, src + 0, 8);
}

#define NUM_THREADS 4

uint64_t fill_fast(char* buf, int size, uint64_t counter, int* excess)
{
    const int num_digits = num_of_digits(counter);
    const int chunk_size = 8 * num_digits + 47;
    const int num_iter = size / chunk_size;
    int thread;
#pragma omp parallel for
    for (thread = 0; thread < NUM_THREADS; ++thread)
    {
        const int begin_iter = num_iter * thread / NUM_THREADS;
        const int thread_num_iter = num_iter * (thread + 1) / NUM_THREADS - begin_iter;
        char* output = buf + begin_iter * chunk_size;
        int num_ofs[8];
        fill15(output, counter + begin_iter, num_digits, num_ofs);
        for (int iter = 1; iter < thread_num_iter; ++iter)
        {
            copy_47_8n(output, chunk_size);
            for (int i = 0; i < 8; ++i)
                add_15_to_decimal_num(output + chunk_size + num_ofs[i], num_digits);
            output += chunk_size;
        }
    }

    buf += num_iter * chunk_size;
    size -= num_iter * chunk_size;
    counter += num_iter * 15;

    return fill_general(buf, size, counter, excess);
}

uint64_t fill(char* buf, int size, uint64_t counter, int* excess)
{
    int num_digits = num_of_digits(counter);
    int64_t max_next_counter = counter + size / (8 * num_digits + 47) * 15 + 15;
    int max_next_num_digits = num_of_digits(max_next_counter);
    if (num_digits == max_next_num_digits)
        return fill_fast(buf, size, counter, excess);
    else
        return fill_general(buf, size, counter, excess);
}

void file_io(void)
{
    int check_file = 0;
    HANDLE f = CreateFileA(check_file ? "my.txt" : "NUL", GENERIC_WRITE, FILE_SHARE_READ, 0, CREATE_ALWAYS, FILE_ATTRIBUTE_NORMAL, 0);
    DWORD e = GetLastError();
    LARGE_INTEGER frequency;
    QueryPerformanceFrequency(&frequency);

    DWORD read;
    int bufsize = 1 << 18;
    long long statsize = 1ll << 34;
    char* buf = malloc(bufsize);
    uint64_t counter = 1;
    int excess = 0;
    while (counter < 9999999900000000)
    {
        LARGE_INTEGER start, stop;
        QueryPerformanceCounter(&start);
        for (int i = 0; i < statsize / bufsize; ++i)
        {
            memcpy(buf, buf + bufsize, excess);
            counter = fill(buf + excess, bufsize - excess, counter, &excess);
            e = WriteFile(f, buf, bufsize, &read, 0);
            if (check_file)
                FlushFileBuffers(f);
            if (e == 0 || (int)read != bufsize)
            {
                e = GetLastError();
                exit(1);
            }
        }
        QueryPerformanceCounter(&stop);
        double time = (double)(stop.QuadPart - start.QuadPart) / frequency.QuadPart;
        printf("Throughput (GB/s): %f\n", statsize / (1 << 30) / time);
    }

    CloseHandle(f);
    exit(0);
}

int main()
{
    file_io();
}

Toby Speight · Accepted Answer · 2021-01-20 07:38:29Z

My solution maintains a buffer with a batch of lines (6000 lines worked best on my system), and updates all the numbers in the buffer in a parallelisable loop. We use an auxiliary array nl[] to keep track of where each newline lies, so we have random access to all the numbers.

The addition is all in-place decimal character-by-character arithmetic, with no arithmetic division after the buffer is initialised (I could have created the buffer without division, too, but opted for shorter, readable code!). Every so often, when the number of digits rolls over, we have to stop and re-position all the numbers within the buffer (that's what the shuffle counter is for), and update the corresponding entries in nl[]; this happens more and more infrequently as we proceed.

I compiled using gcc -std=gnu17 -Wall -Wextra -fopenmp -O3 -march=native, and ran with OMP_NUM_THREADS=3 set in the environment (a different number of threads may be optimal on another host).

#include <stdatomic.h>
#include <stdio.h>              /* sprintf */
#include <string.h>             /* memset */
#include <unistd.h>

/* This is the single tunable you need to adjust for your platform */
#define chunk 6000 /* must be multiple of 3*5, with only one nonzero digit */
/* i.e. 3, 6 or 9 times an exact power of ten */

/* Select a number of digits to use.  If we produce one billion numbers
   per second, then we'll finish all the 18-digit numbers in just 30
   years.  24 digits should suffice until next geological epoch, at least. */
#define numlen 25               /* 24 decimal digits plus newline */

#define STR_(x) #x
#define STR(x) STR_(x)
#define chunk_str STR(chunk)

#define unlikely(e) __builtin_expect((e), 0)

char format[chunk * numlen];
char *nl[chunk+1];

int main()
{
    /* Create the format string. */
    /* We do this twice, as the numbers written first time round are
       too short for the addition. */
    for (int j = 0, n = 1;  j < 2;  ++j)
    {
        nl[0] = format;
        char *p = format;
        for (int i = 0;  i <= chunk;  ++i, ++n) {
            if ((n % 15) == 0) {
                p += sprintf(p, "FizzBuzz\n");
            } else if ((n % 5) == 0) {
                p += sprintf(p, "Buzz\n");
            } else if ((n % 3) == 0) {
                p += sprintf(p, "Fizz\n");
            } else {
                p += sprintf(p, "%d\n", n);
            }
            nl[i] = p;
        }
        write(1, format, nl[chunk] - format);
    }

    atomic_int shuffle = 0;
    for (;;) {
#pragma omp parallel for schedule(static)
        for (int i = 0;  i < chunk;  ++i) {
            if (nl[i+1][-2] == 'z') {
                /* fizz and/or buzz - not a number */
                continue;
            }
            /* else add 'chunk' to the number */
            static const int units_offset = sizeof chunk_str;
            static const int digit = chunk_str[0] - '0';
            char *p = nl[i+1] - units_offset;
            *p += digit;
            while (*p > '9') {
                *p-- -= 10;
                ++*p;
            }
            if (unlikely(p < nl[i])) {
                /* digit rollover */
                ++shuffle;
            }
        }
        if (unlikely(shuffle)) {
            /* add a leading one to each overflowing number */
            char **nlp = nl + chunk;
            char *p = *nlp;
            char *dest = p + shuffle;
            while (p < dest) {
                if (*p == '\n') {
                    *nlp-- = dest + 1;
                } else if (*p == '\n'+1) {
                    --*p;
                    *dest-- = '1';
                    *nlp-- = dest + 1;
                }
                *dest-- = *p--;
            }
            shuffle = 0;
        }
        write(1, format, nl[chunk] - format);
    }
}

Wouldn't memcpy(p,"Fizz\n",6); p+=5 be faster than sprintf? — EasyasPi, Commented Jan 19, 2021 at 22:05
Optimal with OMP_NUM_THREADS=1 🤔. @EasyasPi tried that, didn't make a noticeable difference — Omer Tuchfeld, Commented Jan 19, 2021 at 22:22
@EasyasPi, ITYM memcpy(p,"Fizz\n",5); to avoid pointlessly writing the null character each time? memcpy/sprintf tuning makes no measurable difference, as that's only the setup, outside the main loop. sprint() makes for more maintainable code. (I'm new to programming for all-out speed; in my day job that comes in third, after robustness and maintainability) — Toby Speight, Commented Jan 20, 2021 at 7:24
@Omer: yes, the poor parallelisation was a disappointment. I tried some other parallelisations, too (separate array of formatted numbers, and putting the serialisation and writing into its own thread). If I'm to get any real benefit, I might have to go low-level and hand-craft the threads and their synchronisation (probably switch to C++ for that). — Toby Speight, Commented Jan 20, 2021 at 7:31
I think I manage slightly better if I have atomic_int shuffle instead of the reduction. (time passes...) Yes, and I've updated to code that actually works faster in parallel, at last! — Toby Speight, Commented Jan 20, 2021 at 7:39

betseg · Accepted Answer · 2022-02-07 23:50:24Z

4

Trivial Rust

This one is just plain Rust without any tricks (no unsafe, no vmsplice(), no assembly), just a light loop unrolling. ~~It manages to reach 6GiB/s on my laptop (XPS 15 i9)~~ (it's wrong, see in comments), I'm curious to know how much it does on the reference hardware.

Compile with cargo build --release, run with ./target/release/fizzbuzz | pv >/dev/null

use std::error::Error;
use std::fmt::Write;
use std::io::Write as OtherWrite;

const LEN: usize = 1000000000;

fn main() -> Result<(), Box<dyn Error>> {
    let stdout = std::io::stdout();
    let mut stdout = std::io::BufWriter::new(stdout.lock());
    let mut buffer = String::new();
    buffer.reserve(80);
    let mut n = 0usize;
    while n < LEN {
        write!(
            &mut buffer,
            r#"{}
{}
Fizz
{}
Buzz
Fizz
{}
{}
Fizz
Buzz
{}
Fizz
{}
{}
FizzBuzz
"#,
            n + 1,
            n + 2,
            n + 4,
            n + 7,
            n + 8,
            n + 11,
            n + 13,
            n + 14
        )?;
        stdout.write_all(buffer.as_bytes())?;
        n += 15;
        buffer.clear(); // forgot that ...
    }
    Ok(())
}

edited Feb 7, 2022 at 23:50

betseg

8,7233 gold badges27 silver badges60 bronze badges

answered Dec 5, 2021 at 13:32

Xavier Bestel

411 silver badge2 bronze badges

1

\$\begingroup\$ Getting 8.5GiB/s! Great throughput / complexity ratio. That puts its in third place I believe. This person promised 500$ to whoever achieves this - news.ycombinator.com/item?id=29413879 \$\endgroup\$
– Omer Tuchfeld
Commented Dec 5, 2021 at 13:34
\$\begingroup\$ Promised to donate to Watsi*, my bad \$\endgroup\$
– Omer Tuchfeld
Commented Dec 5, 2021 at 13:41
1

\$\begingroup\$ Although I would suggest to first make sure that the output is correct and that the throughput holds up and doesn't decrease after a while in comparison to the C implementation before deciding on which is better, because the throughput is very close \$\endgroup\$
– Omer Tuchfeld
Commented Dec 5, 2021 at 13:45
1

\$\begingroup\$ No sweat @OmerTuchfeld, I wasn't in there for the money anyway. Thanks a lot for the test, it would be nice to be on the podium for such a simple code. \$\endgroup\$
– Xavier Bestel
Commented Dec 5, 2021 at 13:56
3

\$\begingroup\$ Unfortunately it's wrong. I forgot to clear the buffer, and now that I did it it kills the throughtput (I edited the code). \$\endgroup\$
– Xavier Bestel
Commented Dec 5, 2021 at 14:20

Add a comment |

arrmansa · Accepted Answer · 2022-04-25 19:42:12Z

Python

with numba (original)

import numpy as np
from numba import njit

@njit
def get_arr(i,j):
    a = np.arange(i,j)
    return a[np.bitwise_and(a%5 != 0, a%3 != 0)]

def fizzbuzzv3(chunk,length):
    string = ""
    for i in range(1,chunk+1):
        if i%15 == 0:
            string += "FizzBuzz"
        elif i%3 == 0:
            string += "Fizz"
        elif i%5 == 0:
            string += "Buzz"
        else:
            string += "{}"
        string += "\n"
    string = string[:-2]
    for i in range(1,length,chunk):
        print(string.format(*get_arr(i,i+chunk)))
        
fizzbuzzv3(6000,int(1e100))

Tested on Google Colaboratory with int(1e9) instead of int(1e100) for practical reasons and got 38.3MiB/s.

!python3 test.py | pv > /dev/null
7.33GiB 0:03:16 [38.3MiB/s] [    <=>                                           ]

Faster version with os.write, % string formatting and removed numba

import numpy as np
import os
def fizzbuzz(chunk,length):
    string_arr = np.empty(chunk).astype('<U8')
    string_arr[:] = '%d'
    string_arr[::3] = 'Fizz'
    string_arr[::5] = 'Buzz'
    string_arr[::15] = 'FizzBuzz'
    string = '\n'.join(string_arr) + '\n'
    offset_arr = np.arange(chunk)
    offset_arr = (offset_arr%5 != 0)&(offset_arr%3 != 0)
    offset_arr = np.where(offset_arr)[0]
    for i in range(0,length,chunk):
        to_output = string%tuple(offset_arr.tolist())
        os.write(1,to_output.encode())
        offset_arr += chunk
fizzbuzz(8100,int(1e100))

Google Colaboratory with int(1e9) - 125MiB/s

!python3 test.py | pv > /dev/null
7.33GiB 0:00:59 [ 125MiB/s] [                                     <=>          ]

Pure python, no imports

def fizzbuzz(chunk,length):
    fb_string = ""
    for i in range(0,chunk):
        if i%15 == 0: fb_string += "FizzBuzz"
        elif i%3 == 0: fb_string += "Fizz"
        elif i%5 == 0: fb_string += "Buzz"
        else: fb_string += "%i"
        fb_string += "\n"
    offset_tuple = tuple(i for i in range(chunk) if i%3 != 0 and i%5 != 0)
    for i in range(0,length,chunk):
        print(fb_string % offset_tuple, end='')
        offset_tuple = tuple(i + chunk for i in offset_tuple)
fizzbuzz(6000,int(1e100))

Google Colaboratory with int(1e9) - 87.5MiB/s

!python3 test.py | pv > /dev/null
7.33GiB 0:01:25 [87.5MiB/s] [             <=>                                  ]

Multiprocessing + numpy Version

requires python 3.9 and uses shared memory objects to get output from return_string() on as many cores as possible.

import numpy as np
from os import write
from multiprocessing import shared_memory
from multiprocessing import Pool

def return_string(stringmemory_name,arraymemory_name,offset_arr_shape,offset):
    #recover string from shared bytes
    stringmemory = shared_memory.SharedMemory(name=stringmemory_name)
    fb_string = bytes(stringmemory.buf).decode()
    stringmemory.close()

    #recover numpy array from shared bytes
    arraymemory = shared_memory.SharedMemory(name=arraymemory_name)
    offset_arr = np.ndarray(offset_arr_shape, dtype=np.int64, buffer=arraymemory.buf) + offset
    arraymemory.close()

    #Get output
    to_output = fb_string % tuple(offset_arr.tolist())
    
    #Return encoded
    return to_output.encode()

def fizzbuzz(chunk,length,number_processes):

    #Make the string. Uses numpy arrays because it's easy
    string_arr = np.empty(chunk).astype('<U8')
    string_arr[:] = '%d'
    string_arr[::3] = 'Fizz'
    string_arr[::5] = 'Buzz'
    string_arr[::15] = 'FizzBuzz'
    fb_string = '\n'.join(string_arr.tolist()) + '\n'

    #Convert string to bytes and put it into shared memory
    fb_string = fb_string.encode()
    stringmemory = shared_memory.SharedMemory(create=True, size=len(fb_string))
    stringmemory.buf[:len(fb_string)] = fb_string

    #Make the offset array
    offset_arr = np.arange(chunk)
    offset_arr = (offset_arr%5 != 0)&(offset_arr%3 != 0)
    offset_arr = np.where(offset_arr)[0]

    #Put array into shared memory
    arraymemory = shared_memory.SharedMemory(create=True, size=offset_arr.nbytes)
    temp = np.ndarray(offset_arr.shape, dtype=offset_arr.dtype, buffer=arraymemory.buf)
    temp[:] = offset_arr[:]

    #Go over chunks
    with Pool(processes=number_processes) as pool:
        running_list = []
        for i in range(0,length,chunk):
            #Do not exceed number_processes
            if len(running_list) >= number_processes:
                running_list[0].wait()
            #Call a new function
            async_instance = pool.apply_async(return_string, \
            (stringmemory.name,arraymemory.name,offset_arr.shape,i))
            running_list.append(async_instance)
            #output
            if running_list[0].ready():
                write(1,running_list[0].get())
                del running_list[0]

        while len(running_list) != 0:
            running_list[0].wait()
            if running_list[0].ready():
                write(1,running_list[0].get())
                del running_list[0]

    stringmemory.close()
    stringmemory.unlink()
    arraymemory.close()
    arraymemory.unlink()

fizzbuzz(750000,int(1e100),32)

Google collab with fizzbuzz(750000,int(1e9),8) was only 77.7 MiB/s but that only has 2 cores. On a 16C/32T cpu I think it should do much better. If it isn't too much trouble, please try out different values for for the chunk size (first argument - 750000) and number of processes (last argument - 32)

!python3 test.py | pv > /dev/null
7.34GiB 0:01:36 [77.7MiB/s] [  <=>                                             ]

Multicore, numpy and improvised locks

import numpy as np
from os import write
from multiprocessing import shared_memory
from multiprocessing import Pool

def return_string(stringmemory_name,arraymemory_name,offset_arr_shape,offset,process_id,lock_name):
    #recover string from shared bytes
    stringmemory = shared_memory.SharedMemory(name=stringmemory_name)
    fb_string = bytes(stringmemory.buf).decode()
    stringmemory.close()

    #recover numpy array from shared bytes
    arraymemory = shared_memory.SharedMemory(name=arraymemory_name)
    offset_arr = np.ndarray(offset_arr_shape, dtype=np.int64, buffer=arraymemory.buf) + offset
    arraymemory.close()

    #Get output
    to_output = fb_string % tuple(offset_arr.tolist())
    to_output = to_output.encode()

    #lock 
    lock = shared_memory.SharedMemory(name=lock_name)
    while lock.buf[0:] != process_id:
        pass
    lock.close()

    write(1,to_output)
    return int(process_id.decode()) + 1

def fizzbuzz(chunk,length,number_processes):

    #Make the string. Uses numpy arrays because it's easy
    string_arr = np.empty(chunk).astype('<U8')
    string_arr[:] = '%d'
    string_arr[::3] = 'Fizz'
    string_arr[::5] = 'Buzz'
    string_arr[::15] = 'FizzBuzz'
    fb_string = '\n'.join(string_arr.tolist()) + '\n'

    #Convert string to bytes and put it into shared memory
    fb_string = fb_string.encode()
    stringmemory = shared_memory.SharedMemory(create=True, size=len(fb_string))
    stringmemory.buf[:len(fb_string)] = fb_string

    #Make the offset array
    offset_arr = np.arange(chunk)
    offset_arr = (offset_arr%5 != 0)&(offset_arr%3 != 0)
    offset_arr = np.where(offset_arr)[0]

    #Put array into shared memory
    arraymemory = shared_memory.SharedMemory(create=True, size=offset_arr.nbytes)
    temp = np.ndarray(offset_arr.shape, dtype=offset_arr.dtype, buffer=arraymemory.buf)
    temp[:] = offset_arr[:]

    #Improvised Lock
    lock = shared_memory.SharedMemory(create=True, size=1)
    lock.buf[0:] = '0'.encode()

    #Go over chunks
    with Pool(processes=number_processes) as pool:
        running_list = []
        for i in range(0,length,chunk):
            #Do not exceed number_processes
            if len(running_list) >= number_processes:
                running_list[0].wait()
            #Call a new function
            async_instance = pool.apply_async(return_string, \
            (stringmemory.name, arraymemory.name, offset_arr.shape, i, \
            str((i//chunk)%number_processes).encode(), lock.name))
            running_list.append(async_instance)
            #output
            if running_list[0].ready():
                lock.buf[0:] = str(running_list[0].get()%number_processes).encode()
                del running_list[0]
        #overflow
        while len(running_list) != 0:
            running_list[0].wait()
            if running_list[0].ready():
                lock.buf[0:] = str(running_list[0].get()%number_processes).encode()
                del running_list[0]

    stringmemory.close()
    stringmemory.unlink()
    arraymemory.close()
    arraymemory.unlink()
    lock.close()
    lock.unlink()

fizzbuzz(1500000,int(1e100),8)

On google collab (with int(1e9) as usual) this is 15% faster, which makes sense because it doesn't need to pickle the output and send it back to the main program. The improvised lock (single byte of data that contains the printing order) should allow it to print in the correct order despite being async. Also larger chunk size.

!python3 fizzbuzz_multiprocessing_numpy_os.py | pv > /dev/null
7.34GiB 0:01:14 [ 100MiB/s] [                        <=>                       ]

Getting 70mb/s. Added new numba category because calling JIT'ed numpy "Python" would be a bit unfair. I think that little step at the beginning of the plot for your submission shows the JIT doing an optimization to gain some more throughput — Omer Tuchfeld, Commented Apr 16, 2022 at 11:13
Seems to have matched the pypy submission for the best Python-ish submission at 300MB/s — Omer Tuchfeld, Commented Apr 22, 2022 at 12:42
Added a pure python version, this should be the last code from me I think. Thanks for running the benchmarks @OmerTuchfeld. — arrmansa, Commented Apr 23, 2022 at 20:09

Cristian Ciupitu · Accepted Answer · 2022-04-29 19:56:40Z

Python3

I ported Neil+Kamila answers to python3

from os import write

buf = bytearray(256)
out = bytearray(65536 + 4096)

init = b"1\n2\nFizz\n4\nBuzz\nFizz\n7\n8\nFizz\n"

out[:len(init)] = init

fmt = "Buzz\n{}1\nFizz\n{}3\n{}4\nFizzBuzz\n{}6\n{}7\nFizz\n{}9\nBuzz\nFizz\n{}2\n{}3\nFizz\nBuzz\n{}6\nFizz\n{}8\n{}9\nFizzBuzz\n{}1\n{}2\nFizz\n{}4\nBuzz\nFizz\n{}7\n{}8\nFizz\n"

t = 30
i = 1
j = 1

for l in range(1, 20):
    txt = fmt.format(i, i, i, i, i, i, i + 1, i + 1, i + 1, i + 1, i + 1, i + 2, i + 2, i + 2, i + 2, i + 2).encode()
    buf[:len(txt)] = txt

    i *= 10
    while j < i:
        out[t:t+len(txt)] = buf[:len(txt)]
        t += len(txt)
        if t >= 65536:
            u = write(1, out[:65536])
            while u < 65536:
                u += write(1, out[u:65536])

            t -= 65536
            out[:t] = out[65536:65536+t]

        q = 0
        for z in (4, 7, 2, 11, 2, 7, 12, 2, 12, 7, 2, 11, 2, 7, 12, 2):
            q += z + l
            p = q
            if buf[p] < 55:
                buf[p] += 3
            else:
                buf[p] -= 7
                p -= 1
                while buf[p] == 57:
                    buf[p] = 48
                    p -= 1

                buf[p] += 1

        j += 3

On my laptop with i5-8300H python3 fizzbuzz.py | pv > /dev/null results in throughput of 48 MiB/s on average. While running the same code with pypy3 gives me up to 820 MiB/s.

\$\begingroup\$ Welcome to Code Golf, and nice first answer! \$\endgroup\$
– Aaroneous Miller
Commented Dec 2, 2021 at 20:35 — Aaroneous Miller, Commented Dec 2, 2021 at 20:35

user16354282 · Accepted Answer · 2022-06-23 15:58:11Z

Here's my attempt with Node.js 18.4

Initial solution

const MAX = Number.MAX_SAFE_INTEGER
const BUFFER_SIZE = 32 * 1024

function fizzbuzz() {
    let buffer = ''

    for (let i = 0; i < MAX; i += 15) {
        buffer += `${i + 1}\n${i + 2}\nFizz\n${i + 4}\nBuzz\nFizz\n${i + 7}\n${i + 8}\nFizz\nBuzz\n${i + 11}\nFizz\n${i + 13}\n${i + 14}\nFizzBuzz\n`

        if (buffer.length > BUFFER_SIZE) {
            process.stdout.write(buffer)
            buffer = ''
        }
    }
}

fizzbuzz()

This runs at ~110MB/s

Sadly this doesn't satisfy the requirement for large values as the largest safe int in JavaScript is 2^53-1. It also quickly hits OOM errors because for whatever reason the garbage collector doesn't free up the temporary strings.

Single threaded

const MAX = 0x7FFFFFFFFFFFFFFFn
const BUFFER_SIZE = 64 * 1024

const MAX_DECIMALS = 19 // 19 max bytes (decimal places) for 2^63-1
const BASE10 = [...Array(MAX_DECIMALS + 1).keys()].map((i) => 10n ** BigInt(i))

const FIZZ_BUZZ_PER_CYCLE = (15 / 3) + (15 / 5)
const INT_PER_CYCLE = 15 - FIZZ_BUZZ_PER_CYCLE
const BYTES_PER_CYCLE = (FIZZ_BUZZ_PER_CYCLE * 4) + (INT_PER_CYCLE * MAX_DECIMALS) // 4 bytes for 'fizz' and 'buzz'', 
const CYCLES = Math.floor(BUFFER_SIZE / BYTES_PER_CYCLE)

function fizzbuzz() {
    const buffer = Buffer.alloc(BYTES_PER_CYCLE * CYCLES)
    let offset = 0

    const writeFizz = () => {
        buffer.writeUInt32BE(0x46697a7a, offset)
        offset += 4
        buffer.writeUint8(0x0a, offset)
        offset += 1
    }

    const writeBuzz = () => {
        buffer.writeUInt32BE(0x42757a7a, offset)
        offset += 4
        buffer.writeUint8(0x0a, offset)
        offset += 1
    }

    const writeFizzBuzz = () => {
        buffer.writeUInt32BE(0x46697a7a, offset)
        offset += 4
        buffer.writeUInt32BE(0x42757a7a, offset)
        offset += 4
        buffer.writeUint8(0x0a, offset)
        offset += 1
    }

    // Works between 1 to 2^63-1
    const writeBigInt = (n) => {
        let hasLeading = false

        for (let exp = MAX_DECIMALS; exp >= 0; exp--) {
            const divisor = BASE10[exp]

            if (n >= divisor) {
                const digit = n / divisor
                n = n % divisor

                buffer.writeUint8(0x30 + Number(digit), offset)
                offset += 1
                hasLeading = true
            } else if (hasLeading) {
                buffer.writeUint8(0x30, offset)
                offset += 1
            }
        }

        buffer.writeUint8(0x0a, offset)
        offset += 1
    }

    const printAndResetBuffer = () => {
        process.stdout.write(buffer.subarray(0, offset), 'ascii')
        offset = 0
    }

    for (let i = 0n, cycles = 0; i < MAX; i += 15n, cycles += 1) {
        writeBigInt(i + 1n)
        writeBigInt(i + 2n)
        writeFizz()
        writeBigInt(i + 4n)
        writeBuzz()
        writeFizz()
        writeBigInt(i + 7n)
        writeBigInt(i + 8n)
        writeFizz()
        writeBuzz()
        writeBigInt(i + 11n)
        writeFizz()
        writeBigInt(i + 13n)
        writeBigInt(i + 14n)
        writeFizzBuzz()

        if (cycles >= CYCLES) {
            printAndResetBuffer()
            cycles = 0
        }
    }

    printAndResetBuffer()
}

fizzbuzz()

This runs at an astonishing ~8MB/s. Replacing all the BigInts with native numbers runs at ~50MB/s.

I can't believe this was the fastest solution that I can come up with that doesn't eventually hit OOM errors.

Worker threads

const { Worker, isMainThread, parentPort, workerData } = require('worker_threads')

const MAX = 0x7FFFFFFFFFFFFFFFn
const BUFFER_SIZE = 64 * 1024

const MAX_DECIMALS = 19 // 19 max bytes (decimal places) for 2^63-1
const BASE10 = [...Array(MAX_DECIMALS + 1).keys()].map((i) => 10n ** BigInt(i))

const FIZZ_BUZZ_PER_CYCLE = (15 / 3) + (15 / 5)
const INT_PER_CYCLE = 15 - FIZZ_BUZZ_PER_CYCLE
const BYTES_PER_CYCLE = (FIZZ_BUZZ_PER_CYCLE * 4) + (INT_PER_CYCLE * MAX_DECIMALS) // 4 bytes for 'fizz' and 'buzz'',
const CYCLES = Math.floor(BUFFER_SIZE / BYTES_PER_CYCLE)
const INTS_PER_CYCLE = BigInt(CYCLES) * 15n

const THREADS = 12

async function fizzbuzz() {
    if (isMainThread) {
        const sharedStrBuffer = new SharedArrayBuffer(THREADS * BUFFER_SIZE)
        const sharedCanConsumeBuffer = new SharedArrayBuffer(THREADS * Int32Array.BYTES_PER_ELEMENT)
        const sharedCanProduceBuffer = new SharedArrayBuffer(THREADS * Int32Array.BYTES_PER_ELEMENT)

        const strBuffer = new Uint8Array(sharedStrBuffer)
        const canConsume = new Int32Array(sharedCanConsumeBuffer) // when non-zero, it stores the position of the last character
        const canProduce = new Int32Array(sharedCanProduceBuffer) // when non-zero, worker thread can work

        const workers = []
        for (let threadId = 0; threadId < THREADS; threadId++) {
            canConsume[threadId] = 0
            canProduce[threadId] = 1

            const worker = new Worker(__filename, {
                workerData: {
                    threadId,
                    sharedStrBuffer,
                    sharedCanConsumeBuffer,
                    sharedCanProduceBuffer,
                }
            })

            workers.push(worker)
        }

        const step = INTS_PER_CYCLE * BigInt(THREADS)
        for (let i = 0n; i < MAX; i += step) {
            for (let threadId = 0; threadId < THREADS; threadId++) {
                Atomics.wait(canConsume, threadId, 0)

                const offsetStart = threadId * BUFFER_SIZE
                const offsetEnd = Atomics.load(canConsume, threadId)
                process.stdout.write(strBuffer.subarray(offsetStart, offsetEnd), 'ascii')

                Atomics.store(canProduce, threadId, 1)
                Atomics.notify(canProduce, threadId)
            }
        }
    } else {
        const { threadId } = workerData

        const strBuffer = new Uint8Array(workerData.sharedStrBuffer)
        const canConsume = new Int32Array(workerData.sharedCanConsumeBuffer)
        const canProduce = new Int32Array(workerData.sharedCanProduceBuffer)

        const initOffset = threadId * BUFFER_SIZE
        let offset = initOffset

        const writeFizz = () => {
            strBuffer[offset + 0] = 0x46
            strBuffer[offset + 1] = 0x69
            strBuffer[offset + 2] = 0x7a
            strBuffer[offset + 3] = 0x7a

            strBuffer[offset + 4] = 0x0a
            offset += 5
        }

        const writeBuzz = () => {
            strBuffer[offset + 0] = 0x42
            strBuffer[offset + 1] = 0x75
            strBuffer[offset + 2] = 0x7a
            strBuffer[offset + 3] = 0x7a

            strBuffer[offset + 4] = 0x0a
            offset += 5
        }

        const writeFizzBuzz = () => {
            strBuffer[offset + 0] = 0x46
            strBuffer[offset + 1] = 0x69
            strBuffer[offset + 2] = 0x7a
            strBuffer[offset + 3] = 0x7a

            strBuffer[offset + 4] = 0x42
            strBuffer[offset + 5] = 0x75
            strBuffer[offset + 6] = 0x7a
            strBuffer[offset + 7] = 0x7a

            strBuffer[offset + 8] = 0x0a
            offset += 9
        }

        // Works between 1 to 2^63-1
        const writeBigInt = (n) => {
            let hasLeading = false

            for (let exp = MAX_DECIMALS; exp >= 0; exp--) {
                const divisor = BASE10[exp]

                if (n >= divisor) {
                    const digit = n / divisor
                    n = n % divisor

                    strBuffer[offset] = 0x30 + Number(digit)
                    offset += 1
                    hasLeading = true
                } else if (hasLeading) {
                    strBuffer[offset] = 0x30
                    offset += 1
                }
            }

            strBuffer[offset] = 0x0a
            offset += 1
        }

        const intsPerGlobalCycle = INTS_PER_CYCLE * BigInt(THREADS)
        const totalCycles = (MAX / intsPerGlobalCycle) + 1n

        for (let c = 0n; c < totalCycles; c++) {
            Atomics.wait(canProduce, threadId, 0)
    
            const startInt = (c * intsPerGlobalCycle) + (BigInt(threadId) * INTS_PER_CYCLE)
            const endInt = (startInt + INTS_PER_CYCLE > MAX)
                ? MAX
                : startInt + INTS_PER_CYCLE

            for (let i = startInt; i < endInt; i += 15n) {
                writeBigInt(i + 1n)
                writeBigInt(i + 2n)
                writeFizz()
                writeBigInt(i + 4n)
                writeBuzz()
                writeFizz()
                writeBigInt(i + 7n)
                writeBigInt(i + 8n)
                writeFizz()
                writeBuzz()
                writeBigInt(i + 11n)
                writeFizz()
                writeBigInt(i + 13n)
                writeBigInt(i + 14n)
                writeFizzBuzz()
            }

            Atomics.store(canConsume, threadId, offset)
            Atomics.notify(canConsume, threadId)
            offset = initOffset

            Atomics.store(canProduce, threadId, 0)
        }
    }
}

fizzbuzz()

After running a profiler on my single threaded solution, the biggest slowdown was these 2 lines:

const digit = n / divisor
n = n % divisor

Which makes sense considering that it's division operation on BigInt. I've also thought of using WebAssembly or some C module to access native 64-bit ints but I think that defeats the purpose of this exercise as I want to use pure JavaScript.

Pushing this computation onto worker threads allowed this solution to run at ~380 MB/s with 12 threads.

Side note: For some reason, my pv command freezes when testing:

node worker-threads.js | pv > /dev/null
93.7KiB 0:00:10 [0.00 B/s]

Instead I piped the output to a temp file timeout 30 node worker-threads.js > tmp.txt and then just manually calculated speed = tmp.txt size / 30s. Maybe someone can offer some insight why this is happening for me.

Welcome to Code Golf! Surprised Node didn't do better, honestly. Still beats non-PyPy Python by a landslide though — Rydwolf Programs, Commented Jun 23, 2022 at 16:09

janfrode · Accepted Answer · 2023-07-10 07:04:50Z

This python two-liner (!) is getting >500 MiB/s with pypy on MacBook Pro M1 Max:

$ cat fizzbuzz.py
for i in range (0,1000000000000,15):
   print("%d\n%d\nfizz\n%d\nbuzz\nfizz\n%d\n%d\nfizz\nbuzz\n%d\nfizz\n%d\n%d\nfizzbuzz\n" 
      % (1+i, 2+i, 4+i, 7+i, 8+i, 11+i, 13+i, 14+i))

$ python3 fizzbuzz.py|pv > /dev/null
3.73GiB 0:00:27 [ 143MiB/s] [                                                                                 

$ pypy fizzbuzz.py|pv > /dev/null
70.3GiB 0:02:17 [ 525MiB/s] [

Pypy throughput is very stable at around 522-525 MiB/s.

I can get additional 100 MiB/s by working on 30 values at the time.. so there's more to achieve by further output chunking.

A bit boring solution, but by stepping 60 at the time, this achieves 740 MiB/s with pypy, and 180 MiB/s with python3:

import sys
for i in range (0,100000000000,60):
    sys.stdout.write("%d\n%d\nfizz\n%d\nbuzz\nfizz\n%d\n%d\nfizz\nbuzz\n%d\nfizz\n%d\n%d\nfizzbuzz\n%d\n%d\nfizz\n%d\nbuzz\nfizz\n%d\n%d\nfizz\nbuzz\n%d\nfizz\n%d\n%d\nfizzbuzz\n%d\n%d\nfizz\n%d\nbuzz\nfizz\n%d\n%d\nfizz\nbuzz\n%d\nfizz\n%d\n%d\nfizzbuzz\n%d\n%d\nfizz\n%d\nbuzz\nfizz\n%d\n%d\nfizz\nbuzz\n%d\nfizz\n%d\n%d\nfizzbuzz\n"
        % ( 1+i,  2+i,  4+i,  7+i,  8+i, 11+i, 13+i, 14+i,
           16+i, 17+i, 19+i, 22+i, 23+i, 26+i, 28+i, 29+i,
           31+i, 32+i, 34+i, 37+i, 38+i, 41+i, 43+i, 44+i,
           46+i, 47+i, 49+i, 52+i, 53+i, 56+i, 58+i, 59+i ))


$ python3  fizzbuzz.py|pv > /dev/null
18.1GiB 0:01:40 [ 185MiB/s] 

$ pypy  fizzbuzz.py|pv > /dev/null
30.1GiB 0:00:42 [ 739MiB/s] [

Not following what is missing .. what do you mean by bytecount ? — janfrode, Commented Jul 3, 2023 at 19:08
Sorry, that's my bad. I didn't read what the challenge was. Nice first answer! — The Thonnu, Commented Jul 3, 2023 at 19:17

technusm1 · Accepted Answer · 2023-07-17 17:15:33Z

Here's my answer in Elixir: https://github.com/technusm1/fizzbuzz

Its a full-fledged mix project, so I hosted it on GitHub. On my system MacBook Pro 16-inch 2019 (2.6 GHz 6-Core Intel Core i7), I get the following throughput:

OP's naive C implementation: 68 MiB/s
My naive elixir implementation: 2 MiB/s
My concurrent elixir implementation: 225 MiB/s (breached 200 MiB/s, yay!)

My latest optimization:

Using a binary as my GenServer state instead of iolist in Fizzbuzz.Worker. This reduces message passing overhead when calling IO.binwrite while issuing print command. Here's the code:

defmodule Fizzbuzz do
  def fizzbuzz_no_io(enumerable) do
    Stream.map(enumerable, &reply/1)
    |> Stream.chunk_every(5000)
    |> Enum.into([])
  end

  def reply(n) when rem(n, 15) == 0, do: <<70, 105, 122, 122, 66, 117, 122, 122, 10>>
  def reply(n) when rem(n, 3) == 0, do: <<70, 105, 122, 122, 10>>
  def reply(n) when rem(n, 5) == 0, do: <<66, 117, 122, 122, 10>>
  def reply(n), do: [Integer.to_string(n), <<10>>]
end

defmodule Fizzbuzz.Cli do
  def main([lower, upper]) do
    {lower, upper} = {String.to_integer(lower), String.to_integer(upper)}
    chunk_size = min(div(upper - lower, System.schedulers_online()), 6000)

    if chunk_size == 6000 do
      # We'll divide the input range into 3 parts: beginning, 6k ranges and ending
      # beginning and ending will be processed before and after stream.run respectively.
      input_lower =
        case rem(lower, 15) do
          1 ->
            lower

          0 ->
            IO.binwrite("FizzBuzz\n")
            lower + 1

          remainder ->
            IO.binwrite(Fizzbuzz.fizzbuzz_no_io(lower..(15 - remainder + lower)))
            15 - remainder + lower + 1
        end

      input_upper =
        case rem(upper - input_lower + 1, 6000) do
          0 -> upper
          remainder -> upper - remainder
        end

      input_enumerable = Chunk6kStream.create(input_lower..input_upper)

      Task.async_stream(
        input_enumerable,
        fn input -> elem(GenServer.start_link(Fizzbuzz.Worker, [input]), 1) end,
        timeout: :infinity
      )
      |> Stream.map(fn {:ok, res} -> res end)
      |> Stream.each(fn pid ->
        GenServer.call(pid, :print)
        Process.exit(pid, :kill)
      end)
      |> Stream.run()

      if input_upper < upper do
        IO.binwrite(Fizzbuzz.fizzbuzz_no_io(input_upper+1..upper))
      end
    else
      input_enumerable = get_input_ranges2(lower, upper, chunk_size)

      Task.async_stream(
        input_enumerable,
        fn input -> elem(GenServer.start_link(Fizzbuzz.Worker, [input]), 1) end,
        timeout: :infinity
      )
      |> Stream.map(fn {:ok, res} -> res end)
      |> Stream.each(fn pid ->
        GenServer.call(pid, :print)
        Process.exit(pid, :kill)
      end)
      |> Stream.run()
    end
  end

  def main(_), do: IO.puts("Usage: fizzbuzz 1 10000")

  defp get_input_ranges2(lower, upper, chunk_size) do
    # Need to make this streamable
    if chunk_size >= 10 do
      ChunkRangeStream.create(lower..upper, chunk_size)
    else
      [lower..upper]
    end
  end
end

defmodule Chunk6kStream do
  # Make sure that range has size of multiples of 6000 and range.first is divisible by 15
  def create(range) do
    Stream.resource(fn -> initialize(range) end, &generate_next_value/1, &done/1)
  end

  defp initialize(range) do
    {range, range.first, range.last}
  end

  defp generate_next_value({range, lower, upper}) when lower == upper + 1 do
    {:halt, {range, upper, upper}}
  end

  defp generate_next_value({range, lower, upper}) do
    {[lower..(lower + 5999)], {range, lower + 6000, upper}}
  end

  defp done(_) do
    nil
  end
end

defmodule ChunkRangeStream do
  def create(range, chunk_size) do
    Stream.resource(fn -> initialize(range, chunk_size) end, &generate_next_value/1, &done/1)
  end

  defp initialize(range, chunk_size) do
    {range, chunk_size, range.first}
  end

  defp generate_next_value({range, chunk_size, lower}) do
    if lower < range.last do
      {[lower..min(lower + chunk_size, range.last)],
       {range, chunk_size, min(range.last, lower + chunk_size + 1)}}
    else
      {:halt, {range, chunk_size, lower}}
    end
  end

  defp done(_) do
    nil
  end
end

defmodule Fizzbuzz.Worker do
  use GenServer

  def init([range]) do
    send(self(), {:calculate, range})
    {:ok, []}
  end

  def handle_info({:calculate, range}, _state) do
    res = if Range.size(range) == 6000 do
      i = range.first - 1
      0..(400-1)
      |> Stream.map(fn j ->
        [[Integer.to_string(15 * j + 1 + i), "\n"], [Integer.to_string(15 * j + 2 + i), "\n"], "Fizz\n", [Integer.to_string(15 * j + 4 + i), "\n"], "Buzz\nFizz\n", [Integer.to_string(15 * j + 7 + i), "\n"], [Integer.to_string(15 * j + 8 + i), "\n"], "Fizz\nBuzz\n", [Integer.to_string(15 * j + 11 + i), "\n"], "Fizz\n", [Integer.to_string(15 * j + 13 + i), "\n"], [Integer.to_string(15 * j + 14 + i), "\n"], "FizzBuzz\n"]
      end)
      |> Stream.chunk_every(400)
      |> Enum.into([])
    else
      Fizzbuzz.fizzbuzz_no_io(range)
    end

    {:noreply, res |> :erlang.iolist_to_binary}
  end

  def handle_call(:print, _from, results) do
    IO.binwrite(results)
    {:reply, :ok, []}
  end
end
```

Please include the code in the question. Links can go dead, or the content can change, and future readers need to be able to see the exact code used. — Toby Speight, Commented Jul 10, 2023 at 9:46

user108656user108656 · Accepted Answer · 2021-12-03 04:53:15Z

I wrote two versions in C# 10: one using "vanilla" language features and one based on Isaac's implementation.

In the optimized version, I exploited the fact that an exact length memcpy is not necessary if the buffer is large enough. It's limited to 10^16 iterations, and the first 10 results have a leading 0 (I just thought it would be worth posting it anyway).

Results:

OS: Ubuntu 20.04.3 LTS
CPU: Intel(R) Xeon(R) Platinum 8171M CPU @ 2.60GHz
(note: this is a GitHub Actions worker)

Author              Lang             Avg. Speed
-----------------------------------------------
Paolo Bonzini       C               11.38 GiB/s
Isaac G.            C                1.90 GiB/s
Neil                C                1.83 GiB/s
Kamila Szewczyk     C                1.44 GiB/s
Daniel              C# (opt)      1006.00 MiB/s
Olivier Grégoire    Java           242.62 MiB/s
Daniel              C# (simpl)    125.125 MiB/s

Simple version

using System.Text;

var sb = new StringBuilder();
ulong c = 0;

while (true) {
    while (sb.Length < 1024 * 1024 * 4) {
        sb.Append($"{c + 1}\n{c + 2}\nFizz\n{c + 4}\nBuzz\nFizz\n{c + 7}\n{c + 8}\nFizz\nBuzz\n{c + 11}\nFizz\n{c + 13}\n{c + 14}\nFizzBuzz\n");
        c += 15;
    }
    Console.Out.Write(sb);
    sb.Clear();
}

"Optimized" version

using System.Runtime.CompilerServices;
using System.Runtime.Intrinsics;

using var stdout = Console.OpenStandardOutput();

var buf = new byte[1024 * 1024 * 4];
var counter = new Counter();

while (true) {
    int pos = 0;
    for (int i = 0; i < 4096; i++) {
        pos = Fill30(buf, pos, ref counter);
    }
    stdout.Write(buf.AsSpan(0, pos));
}

int Fill30(byte[] buf, int pos, ref Counter ctr)
{
    var Fizz_ = 0x0A_7A_7A_69_46ul;
    var Fizz_Buzz_ = Vector128.Create(0x7A_75_42_0A_7A_7A_69_46ul, 0x0A_7A).AsByte();
    var Buzz_Fizz_ = Vector128.Create(0x7A_69_46_0A_7A_7A_75_42ul, 0x0A_7A).AsByte();
    var FizzBuzz__ = Vector128.Create(0x7A_7A_75_42_7A_7A_69_46, 0x0Aul).AsByte();

    //0..9
    var prefix = ctr.Get();
    int prefixLen = ctr.NumDigits;
    ctr.Inc();

    AddCtr(1);
    AddCtr(2);
    Add(Fizz_, 5);
    AddCtr(4);
    Add(Buzz_Fizz_, 10);
    AddCtr(7);
    AddCtr(8);
    Add(Fizz_Buzz_, 10);
    //10..19
    prefix = ctr.Get();
    prefixLen = ctr.NumDigits;
    ctr.Inc();
    
    AddCtr(1);
    Add(Fizz_, 5);
    AddCtr(3);
    AddCtr(4);
    Add(FizzBuzz__, 9);
    AddCtr(6);
    AddCtr(7);
    Add(Fizz_, 5);
    AddCtr(9);

    //20..29
    prefix = ctr.Get();
    prefixLen = ctr.NumDigits;
    ctr.Inc();
    
    Add(Buzz_Fizz_, 10);
    AddCtr(2);
    AddCtr(3);
    Add(Fizz_Buzz_, 10);
    AddCtr(6);
    Add(Fizz_, 5);
    AddCtr(8);
    AddCtr(9);
    Add(FizzBuzz__, 9);
    return pos;

    void Add<T>(T val, int len)
    {
        //ref byte ptr = ref Unsafe.Add(ref MemoryMarshal.GetArrayDataReference(buf), pos);
        //Unsafe.WriteUnaligned(ref ptr, val);
        Unsafe.WriteUnaligned(ref buf[pos], val);
        pos += len;
    }
    void AddCtr(int digit)
    {
        Add(prefix, prefixLen);
        Add((short)(0x0A_30 + digit), 2); // "<digit>\n" ascii in little endian
    }
}

unsafe struct Counter
{
    public const int MAX_DIGITS = 16;
    public fixed byte Digits[MAX_DIGITS * 2];
    public int NumDigits;

    public Counter()
    {
        Unsafe.InitBlock(ref Digits[0], (byte)'0', MAX_DIGITS);
        NumDigits = 1;
    }
    public void Inc()
    {
        int i = MAX_DIGITS - 1;
        for (; i >= 0; i--) {
            if (Digits[i] != (byte)'9') {
                Digits[i]++;
                break;
            }
            Digits[i] = (byte)'0';
        }
        NumDigits = Math.Max(NumDigits, MAX_DIGITS - i);
    }
    public Vector128<byte> Get()
    {
        ref byte ptr = ref Digits[MAX_DIGITS - NumDigits];
        return Unsafe.ReadUnaligned<Vector128<byte>>(ref ptr);
    }
}
```

Bysmyyr · Accepted Answer · 2022-06-07 04:20:49Z

Did answer with Go, got my Dell 5560 2.35G but please test with your system. 12 threads(routines in go terms) was best with this 8 core 2 threads per core cpu but please test other numbers too as it changes which is best. Thanks for interesting challenge!

package main

import (
    "os"
    "strconv"
    "sync"
    "unsafe"
)

const buffSize = 200000000
const innerloop = 25000

const routines = 12

func main() {
    eightZeroLock := &sync.Mutex{}
    startLock := eightZeroLock
    endLock := &sync.Mutex{}
    endLock.Lock()
    for i := 0; i < routines-1; i++ {
        go routine(i, startLock, endLock)
        startLock = endLock
        endLock = &sync.Mutex{}
        endLock.Lock()
    }
    go routine(routines-1, startLock, eightZeroLock)
    wg := sync.WaitGroup{}
    wg.Add(1)
    wg.Wait()
}

func routine(num int, start, end *sync.Mutex) {
    counter := num * 15 * innerloop

    var sb Builder
    sb.Grow(buffSize)
    for {

        for i := 0; i < innerloop; i++ {
            sb.WriteString(strconv.Itoa(counter + 1))
            sb.WriteString("\n")
            sb.WriteString(strconv.Itoa(counter + 2))
            sb.WriteString("\nFizz\n")
            sb.WriteString(strconv.Itoa(counter + 4))
            sb.WriteString("\nBuzz\nFizz\n")
            sb.WriteString(strconv.Itoa(counter + 7))
            sb.WriteString("\n")
            sb.WriteString(strconv.Itoa(counter + 8))
            sb.WriteString("\nFizz\nBuzz\n")
            sb.WriteString(strconv.Itoa(counter + 11))
            sb.WriteString("\nFizz\n")
            sb.WriteString(strconv.Itoa(counter + 13))
            sb.WriteString("\n")
            sb.WriteString(strconv.Itoa(counter + 14))
            sb.WriteString("\nFizzBuzz\n")

            counter += 15
        }
        start.Lock()
        os.Stdout.WriteString((sb.String()))
        end.Unlock()
        sb.buf = sb.buf[:0]
        counter += 15 * (routines - 1) * innerloop

    }
}

// After this copied from go stringBuilder(some lines removed)
// Copyright 2017 The Go Authors. All rights reserved.
// Use of this source code is governed by a BSD-style
// license that can be found in the LICENSE file.

// A Builder is used to efficiently build a string using Write methods.
// It minimizes memory copying. The zero value is ready to use.
// Do not copy a non-zero Builder.
type Builder struct {
    addr *Builder // of receiver, to detect copies by value
    buf  []byte
}

// noescape hides a pointer from escape analysis.  noescape is
// the identity function but escape analysis doesn't think the
// output depends on the input. noescape is inlined and currently
// compiles down to zero instructions.
// USE CAREFULLY!
// This was copied from the runtime; see issues 23382 and 7921.
//go:nosplit
//go:nocheckptr
func noescape(p unsafe.Pointer) unsafe.Pointer {
    x := uintptr(p)
    return unsafe.Pointer(x ^ 0)
}

func (b *Builder) copyCheck() {
    if b.addr == nil {
        // This hack works around a failing of Go's escape analysis
        // that was causing b to escape and be heap allocated.
        // See issue 23382.
        // TODO: once issue 7921 is fixed, this should be reverted to
        // just "b.addr = b".
        b.addr = (*Builder)(noescape(unsafe.Pointer(b)))
    } else if b.addr != b {
        panic("strings: illegal use of non-zero Builder copied by value")
    }
}

// String returns the accumulated string.
func (b *Builder) String() string {
    return *(*string)(unsafe.Pointer(&b.buf))
}

// grow copies the buffer to a new, larger buffer so that there are at least n
// bytes of capacity beyond len(b.buf).
func (b *Builder) grow(n int) {
    buf := make([]byte, len(b.buf), 2*cap(b.buf)+n)
    copy(buf, b.buf)
    b.buf = buf
}

// Grow grows b's capacity, if necessary, to guarantee space for
// another n bytes. After Grow(n), at least n bytes can be written to b
// without another allocation. If n is negative, Grow panics.
func (b *Builder) Grow(n int) {
    if n < 0 {
        panic("strings.Builder.Grow: negative count")
    }
    if cap(b.buf)-len(b.buf) < n {
        b.grow(n)
    }
}

// WriteString appends the contents of s to b's buffer.
// It returns the length of s and a nil error.
func (b *Builder) WriteString(s string) (int, error) {
    b.buf = append(b.buf, s...)
    return len(s), nil
}

go run main.go | pv -a >/dev/null

Github: https://github.com/Bysmyyr/fizzbuzz

With 1-8 goroutines throughput seem to be in correlation with the number of goroutines (more routines more throughput). But with 9-50 it's all the same and gets saturated at around 3.8GiB/s. More than ~50 goroutines and it starts to slow down. Surprised it took so long for anyone to post the first golang submission — Omer Tuchfeld, Commented Jun 7, 2022 at 21:19

Toby Speight · Accepted Answer · 2022-07-07 10:06:04Z

Updated code: This code (no more strings) gets me to 4.8GiB/s. Can't break the 5 GiB/s barrier on the M1 with the naive implementation :-(. Also, larger segments, even more memory. However, this has better throughput than just writing empty arrays to the stdout (even with draining to the direct byte buffer).

import java.io.FileDescriptor;
import java.io.FileOutputStream;
import java.nio.ByteBuffer;
import java.nio.charset.StandardCharsets;
import java.util.LinkedList;
import java.util.Optional;
import java.util.Queue;
import java.util.concurrent.CompletableFuture;
import java.util.concurrent.ForkJoinPool;
import java.util.function.Consumer;
import java.util.function.Supplier;
import java.util.stream.IntStream;

public class FizzBuzz {

    static final String NEWLINE = String.format("%n");

    static byte[] fizzbuzz = "fizzbuzz\n".getBytes(StandardCharsets.ISO_8859_1);
    static byte[] fizz = "fizz\n".getBytes(StandardCharsets.ISO_8859_1);
    static byte[] buzz = "buzz\n".getBytes(StandardCharsets.ISO_8859_1);
    static byte newline = "\n".getBytes(StandardCharsets.ISO_8859_1)[0];
    static byte zero = (byte) '0';

    static int fizzBuzz(long src, byte[] destination, int destinationPos, byte[] digitsHelper, int digitsHelperLast) {
        if (src % 15 == 0) {
            System.arraycopy(fizzbuzz, 0, destination, destinationPos, fizzbuzz.length);
            return destinationPos + fizzbuzz.length;
        } else if (src % 5 == 0) {
            System.arraycopy(fizz, 0, destination, destinationPos, fizz.length);
            return destinationPos + fizz.length;
        } else if (src % 3 == 0) {
            System.arraycopy(buzz, 0, destination, destinationPos, buzz.length);
            return destinationPos + buzz.length;
        } else {
            do {
                digitsHelper[digitsHelperLast--] = (byte) (zero + src % 10);
                src /= 10;
            } while (src != 0);
            int len = digitsHelper.length - digitsHelperLast;
            System.arraycopy(digitsHelper, digitsHelperLast, destination, destinationPos, len);
            return destinationPos + len;
        }

    }

    record Pair(byte[] item1, ByteBuffer item2) {
    }


    public static void main(String[] argv) throws Exception {

        final var channel = new FileOutputStream(FileDescriptor.out).getChannel();

        final Consumer<ByteBuffer> writeToChannel = bb -> {
            try {
                while (bb.hasRemaining()) {
                    channel.write(bb);
                }
            } catch (Exception ex) {
                ex.printStackTrace();
                System.exit(1);
            }
        };

        final var segment = 10_000_000;
        final var arralloc = 20 * segment;

        final Queue<CompletableFuture<Pair>> queue =
                new LinkedList<>
                        (IntStream.range(0, Optional.of(ForkJoinPool.getCommonPoolParallelism() - 1).filter(i -> i > 0).orElse(1))
                                .mapToObj(i ->
                                        CompletableFuture.completedFuture(new Pair(new byte[arralloc], ByteBuffer.allocateDirect(arralloc)))).toList());

        CompletableFuture<?> last = CompletableFuture.completedFuture(null);

        final Supplier<Pair> supplier = () -> {
            try {
                return queue.poll().get();
            } catch (Exception ex) {
                ex.printStackTrace();
                System.exit(1);
                return null;
            }
        };


        var nr = 0L;


        while (true) {
            final var start = nr;
            final var end = nr + segment;


            final var finalLast = last;

            var cf = CompletableFuture.completedFuture(supplier.get())
                    .thenApplyAsync(p -> {
                        var arr = p.item1();
                        var bb = p.item2();
                        var pos = 0;
                        var digitsHelper = new byte[20];
                        digitsHelper[19] = newline;

                        for (long l = start; l < end; l++) {
                            pos = fizzBuzz(l, arr, pos, digitsHelper, 18);
                        }
                        bb.clear();
                        bb.put(arr, 0, pos);
                        bb.flip();
                        return p;
                    })
                    .thenCombineAsync(finalLast, (p, v) -> {
                        try {
                            channel.write(p.item2());
                        } catch (Exception ex) {
                            ex.printStackTrace();
                            System.exit(1);
                        }
                        return p;
                    });

            queue.add(cf);
            last = cf;

            nr = end;


        }

    }

}

Regular Java, no tricks on fizzbuzz, needs quite a bit of memory and benefits from multiple threads. On my M1 Pro w/ Java 17 I am at 4.3GiB/s (9 threads + 1 main). ByteBuffer writes max out the channel at ~ 5GiB/s piped via pv (without fizzbuzz).

import java.io.FileDescriptor;
import java.io.FileOutputStream;
import java.nio.ByteBuffer;
import java.nio.charset.StandardCharsets;
import java.util.LinkedList;
import java.util.Optional;
import java.util.Queue;
import java.util.concurrent.CompletableFuture;
import java.util.concurrent.ForkJoinPool;
import java.util.function.Consumer;
import java.util.function.Supplier;
import java.util.stream.IntStream;

public class FizzBuzz {

    static final String NEWLINE = String.format("%n");

    static StringBuilder fizzBuzz(long src, StringBuilder destination) {
        if (src % 15 == 0) {
            destination.append("fizzbuzz");
        } else if (src % 5 == 0) {
            destination.append("fizz");
        } else if (src % 3 == 0) {
            destination.append("buzz");
        } else {
            destination.append(src);
        }
        return destination.append(NEWLINE);
    }

    record Pair<T1, T2>(T1 item1, T2 item2) {
    }

    public static void main(String[] argv) throws Exception {

        final var channel = new FileOutputStream(FileDescriptor.out).getChannel();

        final Consumer<ByteBuffer> writeToChannel = bb -> {
            try {
                while (bb.hasRemaining()) {
                    channel.write(bb);
                }
            } catch (Exception ex) {
                ex.printStackTrace();
                System.exit(1);
            }
        };

        final Queue<CompletableFuture<Pair<StringBuilder, ByteBuffer>>> queue =
                new LinkedList<>
                        (IntStream.range(0, Optional.of(ForkJoinPool.getCommonPoolParallelism() - 1).filter(i -> i > 0).orElse(1))
                                .mapToObj(i ->
                                        CompletableFuture.completedFuture(new Pair<>(new StringBuilder(20 * 1024 * 1024), ByteBuffer.allocateDirect(41 * 1024 * 1024)))).toList());

        CompletableFuture<?> last = CompletableFuture.completedFuture(null);

        final Supplier<Pair<StringBuilder, ByteBuffer>> supplier = () -> {
            try {
                return queue.poll().get();
            } catch (Exception ex) {
                ex.printStackTrace();
                System.exit(1);
                return null;
            }
        };


        var nr = 0L;
        var segment = 1_000_000L;

        while (true) {
            final var start = nr;
            final var end = nr + segment;


            final var finalLast = last;

            var cf = CompletableFuture.completedFuture(supplier.get())
                    .thenApplyAsync(p -> {
                        var sb = p.item1();
                        var bb = p.item2();
                        sb.delete(0, sb.length());
                        for (long l = start; l < end; l++) {
                            fizzBuzz(l, sb);
                        }
                        bb.clear();
                        bb.put(sb.toString().getBytes(StandardCharsets.UTF_8));
                        return p;
                    })
                    .thenCombineAsync(finalLast, (p, v) -> {
                        try {
                            channel.write(p.item2().flip());
                        } catch (Exception ex) {
                            ex.printStackTrace();
                            System.exit(1);
                        }
                        return p;
                    });

            queue.add(cf);
            last = cf;

            nr = end;


        }

    }

}

Doesn't even seem to reach 3GiB/s on my machine. I ran it with /usr/lib/jvm/java-17-openjdk-17.0.1.0.12-1.rolling.fc35.x86_64/bin/java FizzBuzz.java | pv > /dev/null - let me know if there's some other way I should be running this — Omer Tuchfeld, Commented Dec 8, 2021 at 23:22
Probably not. I get the same values (4.3) with both oracle jdk and openjdk so I don't think that that will make a difference for you, either. The thread count is set by Optional.of(ForkJoinPool.getCommonPoolParallelism() - 1).filter(i -> i > 0).orElse(1) - and that can be manually replaced with the number of cores in case it gets it wrong (cores - 1). — Ioan Tandau, Commented Dec 8, 2021 at 23:44
No matter what value I give of(...) I can't seem to go past 2.6GiB/s. Must be some M1 pro magic giving you this throughput — Omer Tuchfeld, Commented Dec 9, 2021 at 0:05
It probably is the M1. What would you get on pv for this: import java.io.FileDescriptor; import java.io.FileOutputStream; import java.nio.ByteBuffer; public class Test { public static void main(String[] argv) throws Exception { var channel = new FileOutputStream(FileDescriptor.out).getChannel(); var buffer = ByteBuffer.allocateDirect(41 * 1024 * 1024); while(true) { while(buffer.hasRemaining()) { channel.write(buffer); } buffer.clear(); } }} ? Just curios what maxes first. — Ioan Tandau, Commented Dec 9, 2021 at 0:23

Olivier Grégoire · Accepted Answer · 2022-10-28 15:48:56Z

Java 17

Compile and run with java FizzBuzz.java.

The algorithm is correct for numbers up to 2^63-1.

import java.io.FileDescriptor;
import java.io.FileOutputStream;
import java.nio.ByteBuffer;
import java.util.concurrent.Callable;
import java.util.concurrent.Executors;
import java.util.concurrent.Future;
import java.util.concurrent.LinkedBlockingQueue;

import static java.nio.charset.StandardCharsets.US_ASCII;

public final class FizzBuzz {

  private static final int BUFFER_SIZE = 1 << 16;
  private static final long STARTING_NUMBER = 10L;
  private static final long SIZE_FOR_30 = 2 * (17 * 8 + 47);
  private static final long FULL_INCREMENT = BUFFER_SIZE / SIZE_FOR_30 * 30;
  private static final long LIGHT_INCREMENT = FULL_INCREMENT - 30;
  private static final int[] BASE_IDX = {-1, 6, 8, 19, 21, 28, 40, 42, 54, 61, 63, 74, 76, 83, 95, 97};
  private static final int[][] IDX = new int[19][16];

  static {
    IDX[1] = BASE_IDX;
    for (var i = 2; i < IDX.length; i++) {
      for (var j = 0; j < 16; ) {
        IDX[i][j] = IDX[i - 1][j] + ++j;
      }
    }
  }


  public static void main(String[] args) {
    var threads = Runtime.getRuntime().availableProcessors();
    var queue = new LinkedBlockingQueue<Future<ByteBuffer>>(threads);
    var executor = Executors.newFixedThreadPool(threads);
    try (var outputStream = new FileOutputStream(FileDescriptor.out);
         var channel = outputStream.getChannel()) {

      var counter = STARTING_NUMBER;
      for (var i = 0; i < threads; i++) {
        var buffer = ByteBuffer.allocateDirect(BUFFER_SIZE);
        queue.offer(executor.submit(new Task(counter, buffer)));
        counter += FULL_INCREMENT;
      }
      channel.write(ByteBuffer.wrap("1\n2\nFizz\n4\nBuzz\nFizz\n7\n8\nFizz\nBuzz\n".getBytes(US_ASCII)));
      while (true) {
        var buffer = queue.poll().get();
        channel.write(buffer);
        queue.offer(executor.submit(new Task(counter, buffer)));
        counter += FULL_INCREMENT;
      }
    } catch (Exception e) {
      e.printStackTrace(System.err);
    } finally {
      executor.shutdown();
    }
  }

  record Task(long startingNumber, ByteBuffer buffer) implements Callable<ByteBuffer> {

    @Override
    public ByteBuffer call() {
      buffer.clear();
      var n = startingNumber;
      var nextPowerOf10 = 10L;
      var digitCount = 1;
      for (; nextPowerOf10 <= n; nextPowerOf10 *= 10L) {
        digitCount++;
      }
      var idx = IDX[digitCount];
      var t = n / 10L;
      var s = Long.toString(t);
      var template = (s + "1\nFizz\n" +
        s + "3\n" +
        s + "4\nFizzBuzz\n" +
        s + "6\n" +
        s + "7\nFizz\n" +
        s + "9\nBuzz\nFizz\n" +
        (s = Long.toString(++t)) + "2\n" +
        s + "3\nFizz\nBuzz\n" +
        s + "6\nFizz\n" +
        s + "8\n" +
        s + "9\nFizzBuzz\n" +
        (s = Long.toString(++t)) + "1\n" +
        s + "2\nFizz\n" +
        s + "4\nBuzz\nFizz\n" +
        s + "7\n" +
        s + "8\nFizz\nBuzz\n").getBytes(US_ASCII);
      n += 30;
      buffer.put(template);

      for (var limit = n + LIGHT_INCREMENT; n < limit; n += 30L) {
        if (n == nextPowerOf10) {
          nextPowerOf10 *= 10;
          digitCount++;
          idx = IDX[digitCount];
          t = n / 10L;
          s = Long.toString(t);
          template = (s + "1\nFizz\n" +
            s + "3\n" +
            s + "4\nFizzBuzz\n" +
            s + "6\n" +
            s + "7\nFizz\n" +
            s + "9\nBuzz\nFizz\n" +
            (s = Long.toString(++t)) + "2\n" +
            s + "3\nFizz\nBuzz\n" +
            s + "6\nFizz\n" +
            s + "8\n" +
            s + "9\nFizzBuzz\n" +
            (s = Long.toString(++t)) + "1\n" +
            s + "2\nFizz\n" +
            s + "4\nBuzz\nFizz\n" +
            s + "7\n" +
            s + "8\nFizz\nBuzz\n").getBytes(US_ASCII);
        } else {
          for (var i = 0; i < idx.length; ) {
            var pos = idx[i++];
            template[pos] += 3;
            while (template[pos] > '9') {
              template[pos] -= 10;
              template[--pos]++;
            }
          }
        }
        buffer.put(template);
      }
      return buffer.flip();
    }
  }

}

/*
 * Speed references:
 *   - 270 GiB / min ± 3.95 / min on Macbook Pro 2021
 */

The warmup is probably either all the threads starting or the JVM pre-JIT-ing code — Seggan, Commented May 13, 2022 at 16:02
@Seggan when I profile, both the JIT and the first outputs (more than the 35 hardcoded characters) are done before the first second while the output speed stays low for the first 3 seconds. — Olivier Grégoire, Commented May 13, 2022 at 16:36
Will only have my desktop next month so I'll only re-run your submission then — Omer Tuchfeld, Commented May 16, 2022 at 7:08
@OmerTuchfeld Oh, wrong output? That's in the hardcoded part. I'll revert (because perf is less) and fix the output. Thanks for letting me know! — Olivier Grégoire, Commented Jun 8, 2022 at 5:43

Antoine S. · Accepted Answer · 2023-07-15 23:42:20Z

Python 3

We use fstrings to create blocks of 300 lines, and reduce the number of conversions from int to str by counting by 100.

def fizz_buzz(write):
    write(
        "1\n2\nFizz\n4\nBuzz\nFizz\n7\n8\nFizz\nBuzz\n11\nFizz\n13\n14\nFiz"
        "zBuzz\n16\n17\nFizz\n19\nBuzz\nFizz\n22\n23\nFizz\nBuzz\n26\nFizz"
        "\n28\n29\nFizzBuzz\n31\n32\nFizz\n34\nBuzz\nFizz\n37\n38\nFizz\nBu"
        "zz\n41\nFizz\n43\n44\nFizzBuzz\n46\n47\nFizz\n49\nBuzz\nFizz\n52\n"
        "53\nFizz\nBuzz\n56\nFizz\n58\n59\nFizzBuzz\n61\n62\nFizz\n64\nBuzz"
        "\nFizz\n67\n68\nFizz\nBuzz\n71\nFizz\n73\n74\nFizzBuzz\n76\n77\nFi"
        "zz\n79\nBuzz\nFizz\n82\n83\nFizz\nBuzz\n86\nFizz\n88\n89\nFizzBuzz"
        "\n91\n92\nFizz\n94\nBuzz\nFizz\n97\n98\nFizz\nBuzz\n"
    )
    h = 0
    while True:
        h += 1
        c1 = str(h)
        h += 1
        c2 = str(h)
        h += 1
        c3 = str(h)
        write(
            f"{c1}01\nFizz\n{c1}03\n{c1}04\nFizzBuzz\n{c1}06\n{c1}07\nFizz\n{c1}09\nBuzz\n"
            f"Fizz\n{c1}12\n{c1}13\nFizz\nBuzz\n{c1}16\nFizz\n{c1}18\n{c1}19\nFizzBuzz\n"
            f"{c1}21\n{c1}22\nFizz\n{c1}24\nBuzz\nFizz\n{c1}27\n{c1}28\nFizz\nBuzz\n"
            f"{c1}31\nFizz\n{c1}33\n{c1}34\nFizzBuzz\n{c1}36\n{c1}37\nFizz\n{c1}39\nBuzz\n"
            f"Fizz\n{c1}42\n{c1}43\nFizz\nBuzz\n{c1}46\nFizz\n{c1}48\n{c1}49\nFizzBuzz\n"
            f"{c1}51\n{c1}52\nFizz\n{c1}54\nBuzz\nFizz\n{c1}57\n{c1}58\nFizz\nBuzz\n"
            f"{c1}61\nFizz\n{c1}63\n{c1}64\nFizzBuzz\n{c1}66\n{c1}67\nFizz\n{c1}69\nBuzz\n"
            f"Fizz\n{c1}72\n{c1}73\nFizz\nBuzz\n{c1}76\nFizz\n{c1}78\n{c1}79\nFizzBuzz\n"
            f"{c1}81\n{c1}82\nFizz\n{c1}84\nBuzz\nFizz\n{c1}87\n{c1}88\nFizz\nBuzz\n"
            f"{c1}91\nFizz\n{c1}93\n{c1}94\nFizzBuzz\n{c1}96\n{c1}97\nFizz\n{c1}99\nBuzz\n"
            f"Fizz\n{c2}02\n{c2}03\nFizz\nBuzz\n{c2}06\nFizz\n{c2}08\n{c2}09\nFizzBuzz\n"
            f"{c2}11\n{c2}12\nFizz\n{c2}14\nBuzz\nFizz\n{c2}17\n{c2}18\nFizz\nBuzz\n"
            f"{c2}21\nFizz\n{c2}23\n{c2}24\nFizzBuzz\n{c2}26\n{c2}27\nFizz\n{c2}29\nBuzz\n"
            f"Fizz\n{c2}32\n{c2}33\nFizz\nBuzz\n{c2}36\nFizz\n{c2}38\n{c2}39\nFizzBuzz\n"
            f"{c2}41\n{c2}42\nFizz\n{c2}44\nBuzz\nFizz\n{c2}47\n{c2}48\nFizz\nBuzz\n"
            f"{c2}51\nFizz\n{c2}53\n{c2}54\nFizzBuzz\n{c2}56\n{c2}57\nFizz\n{c2}59\nBuzz\n"
            f"Fizz\n{c2}62\n{c2}63\nFizz\nBuzz\n{c2}66\nFizz\n{c2}68\n{c2}69\nFizzBuzz\n"
            f"{c2}71\n{c2}72\nFizz\n{c2}74\nBuzz\nFizz\n{c2}77\n{c2}78\nFizz\nBuzz\n"
            f"{c2}81\nFizz\n{c2}83\n{c2}84\nFizzBuzz\n{c2}86\n{c2}87\nFizz\n{c2}89\nBuzz\n"
            f"Fizz\n{c2}92\n{c2}93\nFizz\nBuzz\n{c2}96\nFizz\n{c2}98\n{c2}99\nFizzBuzz\n"
            f"{c3}01\n{c3}02\nFizz\n{c3}04\nBuzz\nFizz\n{c3}07\n{c3}08\nFizz\nBuzz\n"
            f"{c3}11\nFizz\n{c3}13\n{c3}14\nFizzBuzz\n{c3}16\n{c3}17\nFizz\n{c3}19\nBuzz\n"
            f"Fizz\n{c3}22\n{c3}23\nFizz\nBuzz\n{c3}26\nFizz\n{c3}28\n{c3}29\nFizzBuzz\n"
            f"{c3}31\n{c3}32\nFizz\n{c3}34\nBuzz\nFizz\n{c3}37\n{c3}38\nFizz\nBuzz\n"
            f"{c3}41\nFizz\n{c3}43\n{c3}44\nFizzBuzz\n{c3}46\n{c3}47\nFizz\n{c3}49\nBuzz\n"
            f"Fizz\n{c3}52\n{c3}53\nFizz\nBuzz\n{c3}56\nFizz\n{c3}58\n{c3}59\nFizzBuzz\n"
            f"{c3}61\n{c3}62\nFizz\n{c3}64\nBuzz\nFizz\n{c3}67\n{c3}68\nFizz\nBuzz\n"
            f"{c3}71\nFizz\n{c3}73\n{c3}74\nFizzBuzz\n{c3}76\n{c3}77\nFizz\n{c3}79\nBuzz\n"
            f"Fizz\n{c3}82\n{c3}83\nFizz\nBuzz\n{c3}86\nFizz\n{c3}88\n{c3}89\nFizzBuzz\n"
            f"{c3}91\n{c3}92\nFizz\n{c3}94\nBuzz\nFizz\n{c3}97\n{c3}98\nFizz\nBuzz\n"
        )

if __name__ == "__main__":
    import sys
    fizz_buzz(sys.stdout.write)

On my laptop (i7-1165G7):

python (3.11.3): 0.6 GiB/s
pypy3 (7.3.12): 1.3 GiB/s

Arthur · Accepted Answer · 2021-12-02 13:52:48Z

2

Just tried the following Kotlin, compiled to Kotlin Native:

fun main(){
  for (i in 1..1_000_000_000){
    when {
        (i % 3 == 0 && i % 5 == 0) -> println("FizzBuzz")
        i % 3 == 0 -> println("Fizz")
        i % 5 == 0 -> println("Buzz")
        else -> println("$i")
      }
  }
}

2.91MiB/s

Machine specs:

Linux 5.14.18
i7-8550U
16GB Ram

answered Dec 2, 2021 at 13:52

Arthur

1212 bronze badges

3

\$\begingroup\$ Welcome to Code Golf, and nice answer! \$\endgroup\$
– Aaroneous Miller
Commented Dec 2, 2021 at 14:27

Add a comment |

bconstanzo · Accepted Answer · 2021-12-03 17:51:52Z

Adding a bit upon ksousa's answer, you can make two tweaks while keeping readability:

Use f-strings not .format() (because that'd add a function call) in the derailleur() function
Use os.write() instead of print(), because print() is slow (not that os.write is much faster, but it is faster)

It ends up looking like this:

from itertools import cycle, count
from os import write

def derailleur(counter, carousel):
    return f"{carousel if carousel else counter}\n"

def main():
    carousel = cycle([0, 0, "Fizz", 0, "Buzz", "Fizz", 0, 0, "Fizz", "Buzz", 0, "Fizz", 0, 0, "FizzBuzz"])
    counter = count(1)
    f = map(derailleur, counter, carousel)
    while True:
        write(1,  "".join( [ next(f) for _ in range(8192) ] ).encode("utf-8") )

main()

I also played a bit with extending the range, but didn't get a significant speedup from that.

Under WSL2 on a 7200u w/ 16GiB of RAM I'm getting about a 10% to 20% better throughput, but the readings are super noisy (it starts at 15 MiB/s, jumps to 25, back to 20, etc...).

I don't have a Linux machine at hand to test this under a better environment, but I don't expect these changes to make a significant impact. After a few measurements, either using print() or os.write(), about 95% of the time is spent writing the output, and I haven't done much to address that with my changes.

Edit: fixed a few typos and tested on PyPy3 under the same environment:

Nearing 100 MiB/s using os.write()
Slightly faster 108 MiB/s using print()

Edit 2: moar PyPy!

If we throw CPython performance out of the window, we can go with this pretty straightforward code:

def fizzbuzz(x: int) -> str:
    if x % 15 == 0: return "FizzBuzz"
    if x % 5  == 0: return "Buzz"
    if x % 3  == 0: return "Fizz"
    return f"{x}"

def main():
    c = 0
    while True:
        print( "\n".join( fizzbuzz( c + i) for i in range(8192) ) )
        c += 8192

main()

And that is netting about 148 MiB/s on my PC, using PyPy. So about 6x faster than ksousas CPython version. Running it under python3 gives me 22-24 MiB/s, which is close to where I started. Again, I'm fairly certain on my environment writing to stdout dominates and there's very little to be done unless that bottleneck can be addressed.

MarcMush · Accepted Answer · 2021-12-06 16:26:29Z

Julia 1.7

@inbounds function inc!(l::Vector{UInt8}, n, i = length(l)-1)
    if i > 0
        l[i] += n
        if l[i] > 0x39 # '9' 
            l[i] -= 0x0A # 10
            inc!(l, 0x01, i-1)
        end
    else
        pushfirst!(l, 0x31) # '1'
    end
    
    return nothing
end

function main(maxi = typemax(Int), N = 2^16)

    io = IOBuffer(; sizehint=N)
    a = UInt8['1', '\n']
    sizehint!(a, 100)

    for j in 1:N:maxi

        s = take!(io)
        write(stdout, s)
        l = ll = 0
        
        while ll+l+8 < N
            l = 0
            l += write(io, a)
            inc!(a, 0x01)
            l += write(io, a)
            l += write(io, "Fizz\n")
            inc!(a, 0x02)
            l += write(io, a)
            l += write(io, "Buzz\nFizz\n")
            inc!(a, 0x03)
            l += write(io, a)
            inc!(a, 0x01)
            l += write(io, a)
            l += write(io, "Fizz\nBuzz\n")
            inc!(a, 0x03)
            l += write(io, a)
            l += write(io, "Fizz\n")
            inc!(a, 0x02)
            l += write(io, a)
            inc!(a, 0x01)
            l += write(io, a)
            l += write(io, "FizzBuzz\n")
            inc!(a, 0x02)
            ll += l
        end
    end
end

main()

The buffer of size N is optimized for my machine. I get about 350-400 MiB/s in WSL, it might be better on a real Linux machine (let's not talk about the performance in windows). I got slightly better results with julia 1.7 than with 1.6.

Try it online!

@gaffe I agree, but there is an issue with printing apparently discourse.julialang.org/t/why-is-printing-to-a-terminal-slow/…. But I'm sure we can do better regardless — MarcMush, Commented Jun 24, 2022 at 15:40
Do the workarounds at github.com/JuliaLang/julia/issues/43176 help? — user108721, Commented Jun 24, 2022 at 19:56

lonelyelk · Accepted Answer · 2021-12-08 23:54:52Z

2

Ruby

# frozen_string_literal: true
FMT = "%d\n%d\nFizz\n%d\nBuzz\nFizz\n%d\n%d\nFizz\nBuzz\n%d\nFizz\n%d\n%d\nFizzBuzz"
(1..).each_slice(15) do |slice|
  puts format(FMT, slice[0], slice[1], slice[3], slice[6], slice[7], slice[10], slice[12], slice[13])
end

Doesn't seem to be as magical as system languages though. On MacBook Air 1,7 GHz Dual-Core Intel Core i7 8 GB 1600 MHz DDR3 (while watching a twitch stream)

I'm getting from ruby fizzbuzz.rb | pv > /dev/null:

[44.4MiB/s] for ruby 2.7.4 but seems fuctuating
[39.3MiB/s] for ruby 3.1-dev but seems to be more stable

answered Dec 8, 2021 at 23:54

lonelyelk

2411 silver badge5 bronze badges

2

\$\begingroup\$ Getting 100MiB/s ruby 3.0.2p107 (2021-07-07 revision 0db68f0233) [x86_64-linux] \$\endgroup\$
– Omer Tuchfeld
Commented Dec 9, 2021 at 0:07
\$\begingroup\$ I hope this qualifies for the bounty then :) \$\endgroup\$
– lonelyelk
Commented Dec 9, 2021 at 13:36
\$\begingroup\$ Sorry that bounty did not come from me \$\endgroup\$
– Omer Tuchfeld
Commented Dec 9, 2021 at 14:27
\$\begingroup\$ I saw that was not yours. I thought @ksousa wanted dynamic languages. But he rewarded java instead :) \$\endgroup\$
– lonelyelk
Commented Dec 10, 2021 at 16:31
\$\begingroup\$ @lonelyelk you might be interested in my additions: codegolf.stackexchange.com/a/268828/120701 \$\endgroup\$
– Dmitry
Commented Dec 29, 2023 at 3:22

Add a comment |

RandyRanderson · Accepted Answer · 2021-12-09 08:45:18Z

Java: On my 11th gen intel laptop

The OP's simple C version: 25 MiB/s
My simple Java version: 140 MiB/s
python neil+kamila: 8 MiB/s
python ksoua chunking: 4.33 MiB/s
Jan's C version: 1.0 GiB/s

Sometimes simple is better!

To the guy that did a threaded C++ version: wow!

public class FizzBuzz {

    public static void main(String[] args) {

        long maxBufLen=4096<<3;

        var sb=new StringBuilder((int)maxBufLen + 10);
        for (long i = 1; ; i+=1) {

            if ((i % 3 == 0) && (i % 5 == 0)) {
                sb.append("FizzBuzz\n");
            } else if (i % 3 == 0) {
                sb.append("Fizz\n");
            } else if (i % 5 == 0) {
                sb.append("Buzz\n");
            } else {
                sb.append(i).append("\n");
            }
            
            if(sb.length()>maxBufLen) {
                System.out.print(sb);
                sb.setLength(0);
            }
        }
        
    }

}

Copied Ioan's threaded version (not the way I'd do it but it works...) and added some efficiencies:

750MiB/s same laptop


import java.io.FileDescriptor;
import java.io.FileOutputStream;
import java.nio.ByteBuffer;
import java.nio.CharBuffer;
import java.nio.channels.FileChannel;
import java.nio.charset.CharsetEncoder;
import java.nio.charset.StandardCharsets;
import java.util.LinkedList;
import java.util.Queue;
import java.util.concurrent.CompletableFuture;
import java.util.concurrent.ForkJoinPool;
import java.util.function.Supplier;
import java.util.stream.IntStream;

public class FizzBuzzIoan {

    static final String NEWLINE = String.format("%n");
    static final CharsetEncoder encoder = StandardCharsets.US_ASCII.newEncoder();

    static void fizzBuzz(long src, StringBuilder destination) {
        
        if (src % 15 == 0) {
            destination.append("FizzBuzz"+NEWLINE);
        } else if (src % 5 == 0) {
            destination.append("Fizz"+NEWLINE);
        } else if (src % 3 == 0) {
            destination.append("Buzz"+NEWLINE);
        } else {
            destination.append(src).append(NEWLINE);
        }

    }

    record Pair<T1, T2>(T1 item1, T2 item2) {
    }

    public static void main(String[] args) throws Exception {

        try (var fis= new FileOutputStream(FileDescriptor.out)) {
            var channel = fis.getChannel();
            doIt(channel);
        }
    }

    static void doIt(FileChannel channel) {

        var max=Math.max(ForkJoinPool.getCommonPoolParallelism(), 1);
        
        final Queue<CompletableFuture<Pair<StringBuilder, ByteBuffer>>> queue =
                new LinkedList<>(
                        IntStream.range(0, max).mapToObj(i ->
                            CompletableFuture.completedFuture(
                                    new Pair<>(new StringBuilder(20 * 1024 * 1024), ByteBuffer.allocateDirect(41 * 1024 * 1024)))).toList()
                        );

        CompletableFuture<?> last = CompletableFuture.completedFuture(null);

        final Supplier<Pair<StringBuilder, ByteBuffer>> supplier = () -> {
            try {
                return queue.poll().get();
            } catch (Exception ex) {
                ex.printStackTrace();
                System.exit(1);
                return null;
            }
        };


        var nr = 0L;
        var segment = 1_000_000L;
        

        while (true) {
            final var start = nr;
            final var end = nr + segment;


            final var finalLast = last;

            var cf = CompletableFuture.completedFuture(supplier.get())
                    .thenApplyAsync(p -> {
                        var sb = p.item1();
                        var bb = p.item2();
                        
                        sb.setLength(0);
                        
                        for (long l = start; l < end; l++) {
                            fizzBuzz(l, sb);
                        }
                        bb.clear();
                        
                        try {
                            var cb=CharBuffer.wrap(sb);
                            encoder.encode(cb, bb, true);

                        } catch(Exception ex) {
                            throw new RuntimeException(ex);
                        }
                        return p;
                        
                    })
                    .thenCombineAsync(finalLast, (p, v) -> {
                        try {
                            channel.write(p.item2().flip());
                        } catch (Exception ex) {
                            ex.printStackTrace();
                            System.exit(1);
                        }
                        return p;
                    });

            queue.add(cf);
            last = cf;

            nr = end;


        }

    }

}

```

\$\begingroup\$ Welcome to Code Golf, and nice first answer! \$\endgroup\$
– Rydwolf Programs
Commented Dec 3, 2021 at 0:07 — Rydwolf Programs, Commented Dec 3, 2021 at 0:07

Phan Thanh Duy · Accepted Answer · 2022-06-11 07:57:06Z

2

Can you check my version in rust ? On my laptop it seems faster than @aiden version

use nix::unistd::write;
use std::time::{Duration, Instant};

use libc::exit;
use itoap;

#[inline(always)]
fn u64_to_bytes(n: u64, v: &mut Vec<u8>) {
    itoap::write_to_vec(v, n);
    v.push(0x0a);
}

fn main() {
    let mut i: u64 = 0;

    let Fizz = "Fizz\n".as_bytes();
    let Buzz = "Buzz\n".as_bytes();
    let FizzBuzz = "FizzBuzz\n".as_bytes();

    let mut buffering: Vec<u8> = Vec::with_capacity(1024*1024*2);
    loop {
        for _ in 0..512 {
            i += 1;
            u64_to_bytes(i, &mut buffering);
        
            i += 1;
            u64_to_bytes(i, &mut buffering);

            i += 1;
            buffering.extend_from_slice(&Fizz);

            i += 1;
            u64_to_bytes(i, &mut buffering);

            i += 1;
            buffering.extend_from_slice(&Buzz);

            i += 1;
            buffering.extend_from_slice(&Fizz);

            i += 1;
            u64_to_bytes(i, &mut buffering);

            i += 1;
            u64_to_bytes(i, &mut buffering);

            i += 1;
            buffering.extend_from_slice(&Fizz);

            i += 1;
            buffering.extend_from_slice(&Buzz);

            i += 1;
            u64_to_bytes(i, &mut buffering);

            i += 1;
            buffering.extend_from_slice(&Fizz);

            i += 1;
            u64_to_bytes(i, &mut buffering);

            i += 1;
            u64_to_bytes(i, &mut buffering);

            i += 1;
            buffering.extend_from_slice(&FizzBuzz);            
        }
        write(1, buffering.as_slice());
        buffering.clear();
    }
}

[package]
name = "fizzbuzz"
version = "0.1.0"
edition = "2021"

[dependencies]
nix = "0.24.1"
libc = "*"
itoap = "*"

edited Jun 11, 2022 at 7:57

answered Jun 4, 2022 at 6:01

Phan Thanh Duy

212 bronze badges

\$\begingroup\$ Not faster than Aiden on my machine, but comes pretty close \$\endgroup\$
– Omer Tuchfeld
Commented Jun 7, 2022 at 21:03
\$\begingroup\$ Also it wraps around pretty quickly once it reaches 4 billion since you're using u32, so it's disqualified (see the "astronomical number" rule in the main post). Ping me once you've fixed it and I'll test again \$\endgroup\$
– Omer Tuchfeld
Commented Jun 7, 2022 at 21:22
\$\begingroup\$ Hi, I just update to change to u64 and add more buffering. It seems that github.com/omertuc/fizzgolf has a bug. When I run fizztester only on 1 single submission each time it's good but when I run two submissions at the same time it terminates very early. \$\endgroup\$
– Phan Thanh Duy
Commented Jun 11, 2022 at 8:01
\$\begingroup\$ It produces weird result in @aiden bin file. ``` bit@ubuntu-pc:~/fizzgolf$ tail buffer-submissions-rust-aiden-.bin -2147479929 -2147479928 Fizz Buzz -2147479925 Fizz -2147479923 -2147479922 FizzBuzz -2147479920 ``` \$\endgroup\$
– Phan Thanh Duy
Commented Jun 11, 2022 at 8:03
\$\begingroup\$ fizz tester works by pitting two submissions against each other, it doesn't know what the correct output should be. All it does is compare. When it notices a mismatch it dumps the general vicinity of where it noticed a difference, you should run a diff tool on the two bin files to see where exactly the submissions differ \$\endgroup\$
– Omer Tuchfeld
Commented Jun 11, 2022 at 12:09

| Show 2 more comments

psaikko · Accepted Answer · 2022-06-23 08:03:31Z

Go

Compile and run with go run main.go

Some notes:

Creates output in 15-line templates where possible.
Divides work into sections of lines with same base10 integer width, so locations of integers in output buffer can easily be precomputed.
Processes work in large chunks of multiple templates, computed in parallel with goroutines and channels.
Uses modified versions of standard library itoa for int->string conversion to avoid memory allocation.
Caches small (<10000) integer string representations and reuses them where possible.

4.3 GiB/s running this on my desktop

package main

import (
    "fmt"
    "io"
    "os"
    "runtime"
)

const limit = 1 << 61

func main() {
    initItoaCache()
    parallelFizzBuzz(1, limit)
}

const cacheSize = 10000

var logCacheSize = log10(cacheSize)
var itoaCache = make([]string, cacheSize)

func initItoaCache() {
    // precompute string representations
    fmtString := fmt.Sprintf("%%0%dd", logCacheSize)
    for j := 0; j < cacheSize; j++ {
        itoaCache[j] = fmt.Sprintf(fmtString, j)
    }
}

func parallelFizzBuzz(from, to int) {

    for _, wr := range getWidthRanges(from, to) {
        // range which can be filled with templates
        templatesStart := min(wr.to, ceilDiv(wr.from, templateLines)*templateLines)
        templatesEnd := templatesStart + floorDiv(wr.to-templatesStart+1, templateLines)*templateLines

        // handle values before first template
        for i := wr.from; i <= templatesStart; i++ {
            os.Stdout.WriteString(fizzBuzzLine(i))
        }

        // write large chunks in parallel
        const templatesPerJob = 250
        template, placeholderIdxs := fixedWidthTemplate(wr.width)
        nWorkers := runtime.NumCPU()
        chunkSize := nWorkers * templateLines * templatesPerJob

        chunksStart := templatesStart
        chunksEnd := chunksStart + floorDiv(templatesEnd-templatesStart+1, chunkSize)*chunkSize

        if chunksEnd > templatesStart {
            writeParallel(os.Stdout, chunksStart+1, chunksEnd, nWorkers, templatesPerJob, template, wr.width, placeholderIdxs)
        }

        // handle values after last chunk
        for i := chunksEnd + 1; i <= wr.to; i++ {
            os.Stdout.WriteString(fizzBuzzLine(i))
        }
    }
}

type widthRange struct{ from, to, width int }

// getWidthRanges splits integer range [from,to] into disjoint ranges grouped by base10 representation length
func getWidthRanges(from, to int) []widthRange {
    ranges := []widthRange{}

    fromWidth := log10(from + 1)
    toWidth := log10(to + 1)

    for fromWidth < toWidth {
        toVal := pow(10, fromWidth) - 1
        ranges = append(ranges, widthRange{from, toVal, fromWidth})
        from = toVal + 1
        fromWidth += 1
    }

    ranges = append(ranges, widthRange{from, to, fromWidth})
    return ranges
}

// fizzBuzzLine is used to form lines outside of "chunkable" regions
func fizzBuzzLine(i int) string {
    if (i%3 == 0) && (i%5 == 0) {
        return "FizzBuzz\n"
    } else if i%3 == 0 {
        return "Fizz\n"
    } else if i%5 == 0 {
        return "Buzz\n"
    } else {
        return fastItoa(uint64(i)) + "\n"
    }
}

const templateLines = 15

func fixedWidthTemplate(valueWidth int) ([]byte, []int) {
    template := make([]byte, 0, 15+valueWidth*8+4*8)
    formatString := fmt.Sprintf("%%0%dd\n", valueWidth)
    placeholder := []byte(fmt.Sprintf(formatString, 0))
    placeholderIdxs := make([]int, 0, 8)
    fizzBytes := []byte("Fizz\n")
    buzzBytes := []byte("Buzz\n")
    fizzBuzzBytes := []byte("FizzBuzz\n")

    placeholderIdxs = append(placeholderIdxs, len(template))
    template = append(template, placeholder...)
    placeholderIdxs = append(placeholderIdxs, len(template))
    template = append(template, placeholder...)
    template = append(template, fizzBytes...)
    placeholderIdxs = append(placeholderIdxs, len(template))
    template = append(template, placeholder...)
    template = append(template, buzzBytes...)
    template = append(template, fizzBytes...)
    placeholderIdxs = append(placeholderIdxs, len(template))
    template = append(template, placeholder...)
    placeholderIdxs = append(placeholderIdxs, len(template))
    template = append(template, placeholder...)
    template = append(template, fizzBytes...)
    template = append(template, buzzBytes...)
    placeholderIdxs = append(placeholderIdxs, len(template))
    template = append(template, placeholder...)
    template = append(template, fizzBytes...)
    placeholderIdxs = append(placeholderIdxs, len(template))
    template = append(template, placeholder...)
    placeholderIdxs = append(placeholderIdxs, len(template))
    template = append(template, placeholder...)
    template = append(template, fizzBuzzBytes...)
    return template, placeholderIdxs
}

func writeParallel(f io.Writer, firstLine, lastLine, nWorkers, templatesPerJob int, template []byte, width int, placeholderIdxs []int) {

    totalLines := lastLine - firstLine + 1
    workerLines := templateLines * templatesPerJob
    linesPerRound := nWorkers * workerLines
    if totalLines%linesPerRound != 0 {
        panic("uneven work allocation")
    }

    jobChannels := make([]chan int, nWorkers)
    resultChannels := make([]chan []byte, nWorkers)
    totalJobs := ceilDiv(totalLines, workerLines)
    jobsPerWorker := ceilDiv(totalLines, workerLines*nWorkers)

    for i := 0; i < nWorkers; i++ {
        jobChannels[i] = make(chan int, jobsPerWorker*2)
        resultChannels[i] = make(chan []byte, 1)
        go worker(jobChannels[i], resultChannels[i], templatesPerJob, template, width, placeholderIdxs)
    }

    // deal out jobs to workers
    for job := 0; job < totalJobs; job++ {
        jobLine := firstLine + job*workerLines
        jobChannels[job%nWorkers] <- jobLine
    }

    // read buffers from workers
    for job := 0; job < totalJobs; job++ {
        f.Write(<-resultChannels[job%nWorkers])
    }
}

func worker(in <-chan int, out chan<- []byte, templatesPerJob int, template []byte, width int, idxs []int) {
    buffer := make([]byte, len(template)*templatesPerJob)
    buffer2 := make([]byte, len(template)*templatesPerJob)
    buffer3 := make([]byte, len(template)*templatesPerJob)

    for i := 0; i < templatesPerJob; i++ {
        copy(buffer[len(template)*i:], template)
    }
    copy(buffer2, buffer)
    copy(buffer3, buffer)

    for jobLine := range in {

        nextFlush := (jobLine / cacheSize) * cacheSize
        repeat := false // need to fill twice for a clean template for cached ints

        for i := 0; i < templatesPerJob; i++ {
            off := i * len(template)
            if i*templateLines+jobLine+13 > nextFlush || repeat {
                bufFastItoa(buffer, off+idxs[0]+width, uint64(i*templateLines+jobLine))
                bufFastItoa(buffer, off+idxs[1]+width, uint64(i*templateLines+jobLine+1))
                bufFastItoa(buffer, off+idxs[2]+width, uint64(i*templateLines+jobLine+3))
                bufFastItoa(buffer, off+idxs[3]+width, uint64(i*templateLines+jobLine+6))
                bufFastItoa(buffer, off+idxs[4]+width, uint64(i*templateLines+jobLine+7))
                bufFastItoa(buffer, off+idxs[5]+width, uint64(i*templateLines+jobLine+10))
                bufFastItoa(buffer, off+idxs[6]+width, uint64(i*templateLines+jobLine+12))
                bufFastItoa(buffer, off+idxs[7]+width, uint64(i*templateLines+jobLine+13))
                repeat = !repeat
                if repeat {
                    nextFlush += cacheSize
                }
            } else {
                copy(buffer[off:], buffer[off-len(template):off])
                copy(buffer[off+idxs[0]+width-logCacheSize:], itoaCache[(i*templateLines+jobLine)%cacheSize])
                copy(buffer[off+idxs[1]+width-logCacheSize:], itoaCache[(i*templateLines+jobLine+1)%cacheSize])
                copy(buffer[off+idxs[2]+width-logCacheSize:], itoaCache[(i*templateLines+jobLine+3)%cacheSize])
                copy(buffer[off+idxs[3]+width-logCacheSize:], itoaCache[(i*templateLines+jobLine+6)%cacheSize])
                copy(buffer[off+idxs[4]+width-logCacheSize:], itoaCache[(i*templateLines+jobLine+7)%cacheSize])
                copy(buffer[off+idxs[5]+width-logCacheSize:], itoaCache[(i*templateLines+jobLine+10)%cacheSize])
                copy(buffer[off+idxs[6]+width-logCacheSize:], itoaCache[(i*templateLines+jobLine+12)%cacheSize])
                copy(buffer[off+idxs[7]+width-logCacheSize:], itoaCache[(i*templateLines+jobLine+13)%cacheSize])
            }
        }

        out <- buffer
        buffer, buffer2, buffer3 = buffer2, buffer3, buffer
    }

    close(out)
}

//
// Helpers
//

func max(a, b int) int {
    if a < b {
        return b
    }
    return a
}

func min(a, b int) int {
    if a < b {
        return a
    }
    return b
}

func ceilDiv(val, divisor int) int {
    return (val + divisor - 1) / divisor
}

func floorDiv(val, divisor int) int {
    return val / divisor
}

// pow returns n^k
func pow(n, k int) int {
    if k == 0 {
        return 1
    } else if k == 1 {
        return n
    } else {
        return pow(n, k/2) * pow(n, k-k/2)
    }
}

// log10 returns base 10 logarithm for positive n
func log10(n int) int {
    if n <= 0 {
        panic("bad input")
    }
    i := 0
    c := 1
    for c < n {
        c *= 10
        i++
    }
    return i
}

const smallsString = "00010203040506070809" +
    "10111213141516171819" +
    "20212223242526272829" +
    "30313233343536373839" +
    "40414243444546474849" +
    "50515253545556575859" +
    "60616263646566676869" +
    "70717273747576777879" +
    "80818283848586878889" +
    "90919293949596979899"

// bufFastItoa writes base10 string representation of a positive integer to index i in buffer
// adapted from Go source strconv/itoa.go
func bufFastItoa(buf []byte, i int, u uint64) {

    for u >= 100 {
        is := u % 100 * 2
        u /= 100
        i -= 2
        buf[i+1] = smallsString[is+1]
        buf[i+0] = smallsString[is+0]
    }

    // u < 100
    is := u * 2
    i--
    buf[i] = smallsString[is+1]
    if u >= 10 {
        i--
        buf[i] = smallsString[is]
    }
}

// fastItoa returns base10 string representation of a positive integer
// adapted from Go source strconv/itoa.go
func fastItoa(u uint64) string {

    var dst [20]byte
    i := len(dst)

    for u >= 100 {
        is := u % 100 * 2
        u /= 100
        i -= 2
        dst[i+1] = smallsString[is+1]
        dst[i+0] = smallsString[is+0]
    }

    // u < 100
    is := u * 2
    i--
    dst[i] = smallsString[is+1]
    if u >= 10 {
        i--
        dst[i] = smallsString[is]
    }

    return string(dst[i:])
}

I get a few 100 MB/s improvement by removing the panic checks. — pxeger, Commented Jun 20, 2022 at 15:41
More than I would have expected! The templatesPerJob constant might also not be quite optimal. — psaikko, Commented Jun 20, 2022 at 15:45
Looks impressive, but invalid output, jumps from 100004 to 340006 :( — Omer Tuchfeld, Commented Jun 21, 2022 at 20:13
@OmerTuchfeld looks like a few last minute bugs slipped in. Edited in some quick fixes for those now. — psaikko, Commented Jun 23, 2022 at 8:07

Checking throughput

Example

Rules

Scores

Results for java:

Results for python:

Results for pypy:

Results for rust:

Results for ruby:

Results for asm:

Results for julia:

Results for c:

Results for cpp:

Results for numba:

Results for numpy:

Results for go:

Results for php:

Results for elixir:

Results for csharp:

42 Answers 42

x86-64+AVX2 assembly language (Linux, gcc+gas)

Build and usage instructions

Discussion

The program

Performance tuning

The code

The algorithm

Definitions

A few observations

Overflows and carry

C++ tricks

8 is better than 1

Using the compiler for maximum gains

Python3

Initial version

Chunking version

Julia (v1.10)

Update

Trivial Rust

Python

with numba (original)

Faster version with os.write, % string formatting and removed numba

Pure python, no imports

Multiprocessing + numpy Version

Multicore, numpy and improvised locks

Initial solution

Single threaded

Worker threads

Simple version

"Optimized" version

Java 17

Python 3

Ruby

Go

Not the answer you're looking for? Browse other questions tagged fastest-codefizzbuzz or ask your own question.

Linked

Related

x86-64+AVX2 assembly language (Linux, `gcc`+`gas`)

Not the answer you're looking for? Browse other questions tagged
fastest-code
fizzbuzz
or ask your own question.