20
\$\begingroup\$

We're going rootin' tootin' cow-poke shootin!

This is a simple contest, first program to draw their pistol wins.

How it works:

  • I require 2 things from you, an executable, and a command to be run from a bash terminal.
  • The executable needs to do one thing: wait until it receives 'draw!' from STDIN, and then print 'bang!' as fast as possible, and exit.
  • The executable can do anything else in the meantime, even print to STDOUT, as long as it doesn't print 'bang!', or feed itself from STDIN. It also can't use a standard loophole.
  • Printing 'bang!' early, or exiting without printing 'bang!' is not allowed (you'll be one dead son of a gun).
  • Bash, and other sh derivatives, including sh scripts are not allowed, we don't settle things on native land.

How I measure your boot size:

  • Using the golang runtime, I shall execute your command 1,000 times, the time it takes after feeding 'draw!' into STDIN to print 'bang!' will be your score, within +/- 2 nanoseconds. Your program will be given 500ms(+/-100ms of randomness) to stabilize, and load into memory. The program shall be run on ubuntu 20.04 server with only the necessary components to run your program installed.
  • There will be the following categories: Max time, Min time, Avg time, and Median time, all with lower times being the best.

At the end of the month, the fastest gun in the wild west will be chosen. Best of luck!

\$\endgroup\$
13
  • 4
    \$\begingroup\$ Can I start printing bang! before draw! is inputted as long as I don't finish printing bang! if draw! doesn't show up? \$\endgroup\$
    – hyper-neutrino
    Commented Apr 7, 2021 at 18:41
  • 23
    \$\begingroup\$ Using a garbage-collected language like Go to measure the latency of a program that takes handful of CPU cycles is kind of like using footsteps to estimate the length of one grain of sand… \$\endgroup\$ Commented Apr 7, 2021 at 21:49
  • 18
    \$\begingroup\$ Is it guaranteed that no input will be sent to the program other than draw!, or does the program need to filter out other input and search for draw!? \$\endgroup\$ Commented Apr 8, 2021 at 0:38
  • 8
    \$\begingroup\$ Can I download the runtime and test it on my computer? \$\endgroup\$
    – tsh
    Commented Apr 8, 2021 at 2:33
  • 5
    \$\begingroup\$ 1000 iterations is nowhere near enough if you want 2 ns precision. For measuring I/O bound operations, 1,000,000 iterations would probably be too low. \$\endgroup\$
    – Ray
    Commented Apr 9, 2021 at 17:04

12 Answers 12

12
\$\begingroup\$
section .text

global _start

_start:
  xor r8, r8
  xor r9, r9
  xor r10, r10

_do_while:

_getchar:
  cmp r9, r10

  jne _getchar_done

_getchar_read_buffer:
  xor r9, r9

  xor eax, eax
  xor edi, edi
  mov rsi, in_buf
  mov rdx, inbuf_maxlen

  syscall

  mov r10, rax

_getchar_done:
  mov al, [in_buf + r9]
  inc r9


_do_while_cont:
  cmp al, [req + r8]

  jne _reset

_increment:
  inc r8

  cmp r8, 5

  jne _do_while

  mov rax, 1
  mov rdi, 1
  mov rsi, ans
  mov rdx, 5

  syscall

_end:
  mov rax, 60
  xor rdi, rdi

  syscall

_reset:
  xor r8, r8

  cmp al, 'd'

  jne _do_while

  inc r8

  jmp _do_while

section .rodata
  inbuf_maxlen equ 1000000

  req db "draw!"

ALIGN 8
  ans db "bang!"

section .bss
  in_buf resb inbuf_maxlen

A little hand-written assembly. Compile with:

nasm -felf64 foobar.s && ld foobar.o -o foobar

Edit: move in_bi into a register and use xor.

Edit: used the wrong jump, my code was wrong...

Edit: the previous code had a bug where it would read garbage if the read syscall did not return a full buffer. This version is more efficient (and correct)...

Edit: make sure to process ddraw! correctly. Thanks to Anders Kaseorg for the test case.

Edit: re-structure to move getchar out of a function and to jump less.

Edit: don't xor the high bits of the registers at the suggestion of Cody Gray

Edit: align (thanks to Joshua) and move into .rodata

\$\endgroup\$
4
  • 1
    \$\begingroup\$ oooh! I was curious how long it would take! \$\endgroup\$
    – tuskiomi
    Commented Apr 7, 2021 at 18:56
  • 1
    \$\begingroup\$ @HyperNeutrino told me to write this up lol \$\endgroup\$
    – Riolku
    Commented Apr 7, 2021 at 19:02
  • 3
    \$\begingroup\$ There is never a need to zero a 64-bit register. Eliminate, e.g., xor rax, rax from your assembly-language vernacular (unless you're using it for padding). Zeroing the lower 32 bits implicity zeroes the upper 32 bits. There's no advantage to that superfluous REX prefix. It never helps performance, and might hurt it, since it bloats the code. See also: stackoverflow.com/questions/11177137/… \$\endgroup\$ Commented Apr 8, 2021 at 4:20
  • 1
    \$\begingroup\$ Since program size doesn't matter, put 3 bytes between draw and bang so the write() system call is aligned. \$\endgroup\$
    – Joshua
    Commented Apr 10, 2021 at 23:59
8
\$\begingroup\$

C

#include "stdio.h"

int main() {
  char* a = "draw!";
  int i = 0;
  char c;
  while (1) {
    if ((c = getchar()) == a[i]) {
      i++;
    } else {
      i = c == 'd';
    }
    if (i == 5) {
      puts("bang!");
      break;
    }
  }
}

Compiled with gcc -O3 test.c. You can get a copy of the executable here, but it is probably safer to just copy this code and compile it so you can be sure the executable isn't unsafe (I would never give malicious code online, but it is best to be cautious).

This isn't a very good solution, but it probably works as a baseline for other ideas.

\$\endgroup\$
3
  • 1
    \$\begingroup\$ Alright, the scores are in; max: 764.8us min: 48us avg: 113.8us med: 90.7us \$\endgroup\$
    – tuskiomi
    Commented Apr 7, 2021 at 21:18
  • 2
    \$\begingroup\$ Note though that These high times are likely due to limitations on the OS pipe... I'm going to try and count CPU cycles later. I need more time \$\endgroup\$
    – tuskiomi
    Commented Apr 7, 2021 at 21:19
  • \$\begingroup\$ @tuskiomi Counting CPU cycles will change the nature of the challenge. Most of the time for these will be spent waiting for I/O to complete, not executing operations. \$\endgroup\$
    – Ray
    Commented Apr 10, 2021 at 22:00
7
\$\begingroup\$

Java 11 (Oracle JDK)

Just curious about how it runs.

import java.io.FileDescriptor;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.util.Arrays;

import static java.nio.charset.StandardCharsets.US_ASCII;

public class Main {

  public static void main(String[] args) throws IOException {
    var stateMachine = new int[6 * 256];
    var response = "bang!".getBytes(US_ASCII);
    Arrays.fill(stateMachine, 5 * 256);
    stateMachine[5 * 256 + 'd'] = stateMachine[4 * 256 + 'd'] = stateMachine[3 * 256 + 'd'] = stateMachine[2 * 256 + 'd'] = stateMachine[1 * 256 + 'd'] = 4 * 256;
    stateMachine[4 * 256 + 'r'] = 3 * 256;
    stateMachine[3 * 256 + 'a'] = 2 * 256;
    stateMachine[2 * 256 + 'w'] = 256;
    stateMachine[256 + '!'] = 0;

    var in = new FileInputStream(FileDescriptor.in);
    var out = new FileOutputStream(FileDescriptor.out);

    for (var state = 5 * 256; state != 0; ) {
      state = stateMachine[state | in.read()];
    }

    out.write(response, 0, 5);
    System.exit(0);
  }
}

Try it online!

Compile to bytecode with

javac Main.java

Run with

java -Xbatch Main

Note the -Xbatch is important as it will further precompile the bytecode to machine code. It's also important to run with Oracle JDK because it has the best machine compiler.

It will reach the read method in around 200-300 ms after start, after taking time to compile to machine code in memory. So I didn't optimize until there. Hence the reason why I let multiplications as is. However, after the first read, no instruction is superfluous.

I used a state machine to remove all the jumps, similar to Bubbler' submission, because if you thought that ifs are slow, wait until you see ifs in Java!

Also I optimized the reading and writing by creating my own non buffered streams (I used new File{Input,Output}Stream(FileDescriptor.{in,out}) instead of the slower, buffered System.{in,out}).

Implementation note: if EOF is found before the expected trigger, it will crash.

But while I hope this code can compete, I wouldn't be surprised at all if it ends last.

\$\endgroup\$
6
\$\begingroup\$

C (gcc)

#include <stdio.h>

int main() {

  FILE* const in = stdin;
  FILE* const out = stdout;
  char c = 0;

  #define is_or(t, l) \
    c = fgetc_unlocked(in); \
    if (__builtin_expect(c != t, 0)) goto l

  x:
  if (__builtin_expect(c == 'd', 1)) goto r;
  d:
  is_or('d', d);
  r:
  is_or('r', x);
  is_or('a', x);
  is_or('w', x);
  is_or('!', x);
  fwrite_unlocked("bang!\n", 1, 6, out);

  return 0;
}

Try it online!

gcc -Ofast bang.c

I have no idea of its performance. Just have a try...

\$\endgroup\$
2
  • \$\begingroup\$ What is the innovation here, as compared to Noodle9's answer? I mean, I see that the code is slightly different, but why would you expect it to run any faster? Because you've provided a branch hint to the optimizer? \$\endgroup\$ Commented Apr 8, 2021 at 4:13
  • \$\begingroup\$ @CodyGray I have no idea yet. But using a string to compare would certainly different from using multiple compares with gotos. \$\endgroup\$
    – tsh
    Commented Apr 8, 2021 at 5:06
6
\$\begingroup\$

C (gcc)

#include <stdio.h>

int transition_table[128*6] = {
    640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640,
    640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640,
    640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640,
    640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640,
    640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640,
    640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640,
    640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640,
    640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640,

    640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640,
    640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640,
    640, 0,   640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640,
    640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640,
    640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640,
    640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640,
    640, 640, 640, 640, 512, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640,
    640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640,

    640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640,
    640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640,
    640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640,
    640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640,
    640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640,
    640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640,
    640, 640, 640, 640, 512, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640,
    640, 640, 640, 640, 640, 640, 640, 128, 640, 640, 640, 640, 640, 640, 640, 640,
    
    640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640,
    640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640,
    640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640,
    640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640,
    640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640,
    640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640,
    640, 256, 640, 640, 512, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640,
    640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640,
    
    640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640,
    640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640,
    640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640,
    640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640,
    640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640,
    640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640,
    640, 640, 640, 640, 512, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640,
    640, 640, 384, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640,
    
    640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640,
    640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640,
    640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640,
    640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640,
    640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640,
    640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640,
    640, 640, 640, 640, 512, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640,
    640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640
};

int main() {

  FILE* const in = stdin;
  FILE* const out = stdout;
  char c = 0;
  int state = 640;
  while(state) {
      c = fgetc_unlocked(in);
      state = transition_table[state + c];
  }
  fwrite_unlocked("bang!\n", 1, 6, out);

  return 0;
}

Try it online!

Throwing a hardcoded state transition table into the mix, in order to eliminate the jumps as much as possible. Honestly I don't know which optimization flag will give the best results. You could try gcc -Ofast bang.c or gcc -O3 bang.c. (maybe -O2 too?)

\$\endgroup\$
2
  • 1
    \$\begingroup\$ I haven't tested (we don't have any information to test any answers), but I expect using short instead of int for the lookup table to be slightly better. \$\endgroup\$ Commented Apr 8, 2021 at 16:36
  • \$\begingroup\$ I like this one. \$\endgroup\$
    – Tony Ennis
    Commented Apr 8, 2021 at 20:46
6
\$\begingroup\$

Rust

use std::hint::unreachable_unchecked;
use std::io::{stdin, stdout, Read, Write};
pub fn main() {
    let key = b"draw!";
    let temp = stdin();
    let mut stdin = temp.lock();
    let tmp = stdout();
    let mut out = tmp.lock();
    let mut i = 0;
    let mut buf = [0];
    loop {
        stdin
            .read(&mut buf)
            .unwrap_or_else(|_| unsafe { unreachable_unchecked() });
        let read = *unsafe { buf.get_unchecked(0) };
        if unsafe { read == *key.get_unchecked(i) } {
            i += 1;
        } else {
            i = (read == b'd') as usize;
        }
        if i == 5 {
            out.write(b"bang!")
                .unwrap_or_else(|_| unsafe { unreachable_unchecked() });
            out.flush()
                .unwrap_or_else(|_| unsafe { unreachable_unchecked() });
            break;
        }
    }
}

Try it online!

A solution in rust. It is pretty similar to @hyper-neutrino's solution, but it makes sure to flush the output buffer after writing and also asserts that io errors can't happen (it will be undefined behavior if they do, probably causing a sigill). Build rustc -Copt-level=3 -Clto=fat -Ctarget-cpu=native -o main and run ./main

\$\endgroup\$
5
\$\begingroup\$

C

#include "stdio.h"

int main() {
  int i = 0;
  while (1) {
    i ++;
  int c = getchar();
    switch (i) {
      case 1:
        if (c != 100) i = 0;
        continue;
      case 2:
        switch (c) {
          case 100:
            i = 1;
          case 114:
            continue;
          default:
            i = 0;
            continue;
        }
      case 3:
        switch (c) {
          case 100:
            i = 1;
          case 97:
            continue;
          default:
            i = 0;
            continue;
        }
      case 4:
        switch (c) {
          case 100:
            i = 1;
          case 119:
            continue;
          default:
            i = 0;
            continue;
        }
      case 5:
        switch (c) {
          case 100:
            i = 1;
          case 33:
            continue;
          default:
            i = 0;
            continue;
        }
      default:
        puts("bang!");
    }
    break;
  }
}

Compiled with gcc -O3 -o foobar foobar.c, run with ./foobar (I'm on Windows, so the -o and ./foobar is a guess).

A variation on hyper-neutrino's solution. Uses a switch statement instead of indexing into a string. Hopefully, this makes it faster.

\$\endgroup\$
4
  • \$\begingroup\$ Same command to run and compile? \$\endgroup\$
    – tuskiomi
    Commented Apr 7, 2021 at 18:49
  • \$\begingroup\$ @tuskiomi Sorry, just added it. \$\endgroup\$
    – user
    Commented Apr 7, 2021 at 18:49
  • 2
    \$\begingroup\$ This certainly generates more object code than hyper-neutrino's solution. Unfortunately, it's code that is filled with branches. It's hard to see how all those branches would improve the execution speed. I would predict the opposite. \$\endgroup\$ Commented Apr 8, 2021 at 4:17
  • \$\begingroup\$ @CodyGray Oh, I thought using a lookup table would be faster. It's probably too late to change it substantially now anyway. \$\endgroup\$
    – user
    Commented Apr 8, 2021 at 19:40
4
\$\begingroup\$

C (clang), 343 bytes

#include <stdio.h>

int main()
{
    const char draw[] = "draw!";
    size_t i = 0;
    while (i < 5)
    {
        char c = getchar_unlocked();
        i = c==draw[i] ? i + 1 : c=='d';
    }
    putchar_unlocked('b');
    putchar_unlocked('a');
    putchar_unlocked('n');
    putchar_unlocked('g');
    putchar_unlocked('!');

    return 0;
}

Try it online!

Compiled with -Ofast.

\$\endgroup\$
2
  • 2
    \$\begingroup\$ Why do 5 separate putchar calls, instead of writing the entire string at once (e.g., puts_unlocked)? Seems like reducing the number of system calls would improve performance. \$\endgroup\$ Commented Apr 8, 2021 at 4:11
  • 2
    \$\begingroup\$ @CodyGray putchar_unlocked doesn’t actually make the system call; it buffers in the C library. \$\endgroup\$ Commented Apr 8, 2021 at 5:37
3
\$\begingroup\$

C++ (gcc)

Probably not the fastest, but wanted to give it a try:

#include <iostream>

std::string a;

int main(){
    std::cin >> a;
    if(a == "draw!") std::cout << "bang!";
}

Compiled using g++ -Ofast bang.cpp.

\$\endgroup\$
1
1
\$\begingroup\$

Python:

from sys import stdin
def fgitw(func=stdin.readline, it=iter):
    for _ in it(func, 'draw!'):
        pass
    print("bang!")
fgitw()

A smaller version will be simply(but sys.stdin.readline with local variables above should be faster):

for _ in iter(input, 'draw!'):
    pass
print("bang!")
\$\endgroup\$
1
\$\begingroup\$

C (GCC, Linux x86-64)

I have implemented the following ideas naively hoping for performance without testing:

  • the translation draw! to bang! happens via a lookup table, i.e. without any jumps
  • the write to stdout happens via a syscall to Linux' write (without any indirections of calling glibc's write wrapper)
  • the program exits via a syscall to Linux' exit (without any indirections of calling glibc's exit wrapper)

You can see the generated asm on godbolt: https://godbolt.org/z/Ex3T5G5Gq. Compile with gcc -O3 -march=x86-64 -fomit-frame-pointer draw-bang.c -o draw-bang.

#include <stdio.h>
#include <unistd.h>
#include <asm/unistd.h>

static inline ssize_t my_write(int, const void*, size_t);
static inline void __attribute__((noreturn)) my_sys_exit(void);

/**
 * table[1 -- 128] translates ASCII characters with codepoint 0 -- 127
 * to other ASCII characters such that
 * 1. `draw!` translates to `bang!` and
 * 2. any other character translates to `\0`
 * 
 * table[0] is `\0`.
 * 
 * Rationale: `table + 1` can be indexed by the return value of `getc(stdin)`.
 *             And upon error or EOF, we nicely get `table[0]`, i.e. `\0`.
 */
static char table[129] = {
    '\0', // -1-th index
    // 128 entries follow
    '\0','\0','\0','\0','\0','\0','\0','\0','\0','\0', //   0 --   9
    '\0','\0','\0','\0','\0','\0','\0','\0','\0','\0', //  10 --  19
    '\0','\0','\0','\0','\0','\0','\0','\0','\0','\0', //  20 --  29
    '\0','\0','\0', '!','\0','\0','\0','\0','\0','\0', //  30 --  39
    '\0','\0','\0','\0','\0','\0','\0','\0','\0','\0', //  40 --  49
    '\0','\0','\0','\0','\0','\0','\0','\0','\0','\0', //  50 --  59
    '\0','\0','\0','\0','\0','\0','\0','\0','\0','\0', //  60 --  69
    '\0','\0','\0','\0','\0','\0','\0','\0','\0','\0', //  70 --  79
    '\0','\0','\0','\0','\0','\0','\0','\0','\0','\0', //  80 --  89
    '\0','\0','\0','\0','\0','\0','\0', 'n','\0','\0', //  90 --  99
     'b','\0','\0','\0','\0','\0','\0','\0','\0','\0', // 100 -- 109
    '\0','\0','\0','\0', 'a','\0','\0','\0','\0','g',  // 110 -- 119
    '\0','\0','\0','\0','\0','\0','\0','\0'            // 120 -- 127
};
static char out[6]; // to hold "bang!\0"

int main(void) {
    char c, i;
    while ((c = (table + 1)[getc(stdin)])) {
        out[i++] = c;
    }
    my_write(STDOUT_FILENO, out, sizeof(out));
    my_sys_exit();
}

/**
  * "Reimplementation" of write (2) function that GCC is happy to fully inline.
  *
  * Source: <https://stackoverflow.com/a/9508738/603003>
  * Author: Daniel Kamil Kozar <https://stackoverflow.com/users/1155000/daniel-kamil-kozar> and editors
  * License: CC BY-SA 4.0 <https://creativecommons.org/licenses/by-sa/4.0/>
  */
static inline ssize_t my_write(int fd, const void *buf, size_t size) {
    ssize_t ret;
    asm volatile
    (
        "syscall"
        : "=a" (ret)
        //                 EDI      RSI       RDX
        : "0"(__NR_write), "D"(fd), "S"(buf), "d"(size)
        : "rcx", "r11", "memory"
    );
    return ret;
}

/**
 * Exit the current thread (process for single-threaded processes) using the `exit` Linux
 * system call.
 * 
 * This translates to exactly one asm instruction as opposed to calling the usual standard
 * libraries's exit() family of functions: see <https://stackoverflow.com/a/46903734/603003>.
 */
static inline void __attribute__((noreturn)) my_sys_exit(void) {
    asm("syscall" : : "r"(__NR_exit));
}
```
\$\endgroup\$
0
\$\begingroup\$

C++ (clang, Linux x86-64)

#include <bit>
#include <cstdint>
#include <tuple>

#include <unistd.h>
#include <asm/unistd.h>

static_assert(std::endian::native == std::endian::little);

using U64 = std::uint64_t;

inline void sys_write(int fd, void const* buf, std::size_t count)
{
    asm volatile("syscall"
        :
        : "a"(__NR_write), "D"(fd), "S"(buf), "d"(count)
        : "rcx", "r11", "memory");
}

inline ssize_t sys_read(int fd, void* buf, std::size_t count)
{
    ssize_t r;
    asm volatile("syscall"
        : "=a"(r)
        : "0"(__NR_read), "D"(fd), "S"(buf), "d"(count)
        : "rcx", "r11", "memory");
    return r;
}

U64 const DRAW = UINT64_C(0x2177617264);
U64 const MASK = UINT64_C(0xFFFFFFFFFF);

inline bool contains_it(U64 x)
{
    for(auto [pattern, mask] : {
        std::tuple{DRAW <<  0, MASK <<  0},
        std::tuple{DRAW <<  8, MASK <<  8},
        std::tuple{DRAW << 16, MASK << 16},
        std::tuple{DRAW << 24, MASK << 24}
    })
    {
        if((x & mask) == pattern) [[unlikely]]
        {
            return true;
        }
    }
    return false;
}

inline bool contains_it(U64 const* first, U64 const* last_incl)
{
    U64 prev = 0;
    for(auto p = first; p <= last_incl; ++p)
    {
        if(contains_it(*p) || contains_it((*p << 32) | (prev >> 32))) [[unlikely]]
        {
            return true;
        }
        prev = *p;
    }
    return false;
}

int main()
{
    std::uintptr_t const MAX_READ = 1'048'576;
    std::size_t const S = sizeof(U64);

    U64 u64s[1 + MAX_READ / S + 1] = {};

    int r = MAX_READ;

    while(1)
    {
        int rem = r % S;
        U64* end = u64s + 1 + r / S;
        *end &= ~(U64(-1) << (rem * 8));

        if(contains_it(u64s, end))
        {
            sys_write(STDOUT_FILENO, "bang!", 5);
            return 0;
        }

        u64s[0] = end[-1];
        u64s[1] = end[0];

        r = sys_read(STDIN_FILENO, reinterpret_cast<char *>(u64s + 1) + rem, MAX_READ) + rem;
    }
}

Compile with clang++-12 -std=c++20 -Ofast -march=native -o drawbang drawbang.cpp.

Clang’s optimizer is scarily efficient in how it vectorizes this code — godbolt. The entirety of the main loop is under .LBB0_4, where it takes as few as 16 instructions to go through 8 bytes. Some of those are conditional jumps, which shouldn’t matter as the CPU should predict the outcomes correctly, for which purpose the code starts by going through a megabyte of zeroes.

Like other submissions, the code assumes it will be fed junk that it has to ignore until it encounters the key string.

\$\endgroup\$

Not the answer you're looking for? Browse other questions tagged or ask your own question.