Real-world usage of null-terminated strings
The only reason we still care about null-terminated strings is because of the widespread usage of C strings and all the APIs built for them. It's simple to write some basic C or assembly functions that process a string byte by byte and stop at a null byte:
while (*s) {
putchar(*s);
++s;
}
Example compiled assembly from GCC -Os
, with some function boilerplate omitted since I wrote it as a function:
# s starts in rdi
mov rbx, rdi #
L: movsx edi, BYTE PTR [rbx] # while (*s)
test dil, dil #
je done # {
call putchar # putchar(*s)
inc rbx # ++s
jmp L # }
...but not that much simpler than comparing s
to s+len
(GCC outsmarted me and saved an instruction in the loop, because I just counted down len
):
char* end = s + len;
while (s != end) {
putchar(*s);
++s;
}
# s starts in rdi, len starts in rsi
mov rbx, rdi #
lea rbp, [rdi+rsi] # end = s + len
L: cmp rbx, rbp # while (s != end)
je done # {
movsx edi, BYTE PTR [rbx] # arg1 = *s
inc rbx # ++s (gcc does this here for some reason)
call putchar # putchar(arg1)
jmp L # }
Vulnerabilities
The massive downside is that a null-terminated string requires the string to have a null byte at the end AND not earlier. The vulnerabilities are numerous and very costly. Can you spot the problem in each example?
#define MAXLEN 1024
...
char *pathbuf[MAXLEN];
...
read(cfgfile,inputbuf,MAXLEN);
strcpy(pathbuf,inputbuf);
...
char *foo;
int counter;
foo=calloc(sizeof(char)*10);
for (counter=0;counter!=10;counter++) {
foo[counter]='a';
printf("%s\n",foo);
}
char *foo;
foo=malloc(sizeof(char)*5);
foo[0]='a';
foo[1]='a';
foo[2]=0; // example doesn't compile here so I simplified it
foo[3]='c';
foo[4]='\0'; // example has missing ; here
printf("%c %c %c %c %c \n",foo[0],foo[1],foo[2],foo[3],foo[4]);
printf("%s\n",foo);
char firstname[20];
char lastname[20];
char fullname[40];
fullname[0] = '\0';
strncat(fullname, firstname, 20);
strncat(fullname, lastname, 20);
The strn*
functions aren't always safe. strncat(dest, src, count)
always puts a null character at the end, so can write up to count+1
bytes! But strncpy(dest, src, count)
won't null-terminate the result if count
is reached before the entire src
is copied.
Length-based strings
All mainstream languages since C that have a string object have used length-based strings because C strings are so painful and error-prone. A length-based string is easy to implement too (except assembly that doesn't have structs). The most basic example is something like
struct string {
size_t length;
char *text;
};
A size_t len
might take 4 or 8 bytes instead of 1 null-terminating byte, but
many non-trivial string tasks require keeping track of the length of the string or buffer separately anyway, if you don't want to waste time re-computing it with strlen
every time. Plus if those extra bytes really matter, you could do an initial length byte limited to 255 and support "short strings" with the same space overhead of a single byte, or a 2 byte length that can support a very reasonable 65535 characters at the cost of only 1 extra byte. From the Wikipedia article:
FreeBSD developer Poul-Henning Kamp, writing in ACM Queue, referred to the victory of null-terminated strings over a 2-byte (not one-byte) length as "the most expensive one-byte mistake" ever.
A dynamically-sized string or array could also track a size_t capacity
for re-allocating when growing or shrinking.
SDS strings are one implementation that combine a C-string with a prefix before the actual pointer and a null terminator at the end.