20
$\begingroup$

There are two main ways to implement strings in a language: null-terminated and length-prefixed strings.

From Wikipedia:

A null-terminated string is a character string stored as an array containing the characters and terminated with a null character, often this is the null character (NUL), which has all bits zero.

and:

Length-prefixed

The length of a string can also be stored explicitly, for example by prefixing the string with the length as a byte value.

For example, the string "Hello, World!", when using null-terminated strings (using ASCII), is stored as:

48 65 6C 6C 6F 2C 20 57 6F 72 6C 64 21 00
                                       ^^
                                       the NUL character

and using length-prefixed string:

0D 48 65 6C 6C 6F 2C 20 57 6F 72 6C 64 21
^^
the length byte

These approach is vastly different, so what are the advantages of advantages/disadvantages of using one instead of another?

$\endgroup$
4
  • 2
    $\begingroup$ how is this really a language-related question as opposed to a computer science question? Because you mention "implementation" in the question but not "design", so I'm not sure as someone considering answering this question if you care about language design questions like usability- aside from which I'm not really sure if there's more to say than is just about the computer-science face of the question. $\endgroup$
    – starball
    Commented May 21, 2023 at 22:06
  • $\begingroup$ See also: stackoverflow.com/questions/25068903/what-are-pascal-strings $\endgroup$
    – oɔɯǝɹ
    Commented Aug 15, 2023 at 21:36
  • $\begingroup$ A common trick for ASCII strings (pre-unicode) was to use the 8th bit in the byte to mark the end of the string -- the last character would have it set (or clear) and all the others would have it clear (or set). $\endgroup$
    – Chris Dodd
    Commented Oct 28, 2023 at 9:38
  • 1
    $\begingroup$ You might be entertained by my very first blog post from 2003 which was on this subject. ericlippert.com/2003/09/12/… $\endgroup$ Commented Jul 17 at 18:06

12 Answers 12

13
$\begingroup$

An option missing from the question is fat pointers ─ the type &str in Rust is an example of this. The length is not stored on the heap as a prefix to the string data, instead it is stored alongside the pointer, so that a reference to a string takes two words (length and pointer) instead of just one for a pointer.

This means that if there are multiple references to the same string, then the length data is duplicated compared to a length-prefixed string, which would only store the length once, where the string data is. But the upside is that a fat pointer can reference a substring without duplicating the string data on the heap.

Rust String vs. &str memory diagram

In the diagram above (from the official Rust book), s is a String so it has a fat pointer to the whole string allocation (plus a capacity field, since it's a growable string), while world is a shared reference (i.e. a fat pointer) to a substring. This sharing would not be possible with length-prefixing, and would be possible with null-termination for substrings at the end of the string but not otherwise.

$\endgroup$
5
  • 6
    $\begingroup$ Fat pointers are orthogonal to the question and are just a way to implement the length-based variant. $\endgroup$
    – feldentm
    Commented Jul 17, 2023 at 18:12
  • 2
    $\begingroup$ C++ strings (and string_views) follow this model, too (in most implementations). $\endgroup$ Commented Jul 18, 2023 at 8:27
  • $\begingroup$ This schema has a drawback when you want to modify the contents of the string, since you cannot change (nor free) those bytes. So you would have to use immutable strings and implement some kind of garbage collection... $\endgroup$
    – Ángel
    Commented Jul 20 at 1:32
  • $\begingroup$ Immutable strings can have Copy on Write $\endgroup$
    – qwr
    Commented yesterday
  • $\begingroup$ @qwr: In the absence of a tracing collector, code which wants to replace a reference to a possible-shared immutable string with a reference to a different string would need to do an synchronized decrement of the reference count and ensure that eliminating the last reference to a string would either free the original string or (if it had the desired length) recycle it into a non-sharable mutable string instead of creating a new object. $\endgroup$
    – supercat
    Commented yesterday
12
$\begingroup$

Length-prefixed strings have the advantage of being able to find their length in O(1) time rather than O(n) time. This means you can find the end of the string more easily with the length prefix. They are also less error prone to use since you don't have to deal with forgetting to null terminate a string.

One disadvantage to length prefixed strings is that they require more space. In addition, you are limited in what the max size of the string can be based on how many bytes are used to store the length.

$\endgroup$
5
  • 6
    $\begingroup$ Strings which are prefixed by a variable-length prefix would require the same amount of space as zero-terminated strings when they're short enough to encode the length in a single byte. One could encapsulate both buffer size and current length with a single byte of overhead [same as zero-termination] for strings or buffers up to 63 bytes, two bytes for strings up to 4095 bytes, and five bytes for strings up to 2GiB, while leaving a few prefix byte values available for other purposes. $\endgroup$
    – supercat
    Commented Jul 19, 2023 at 17:40
  • $\begingroup$ @supercat another way to do it is the ASN.1/DER scheme. $\endgroup$ Commented Nov 12, 2023 at 5:54
  • 1
    $\begingroup$ @KarlKnechtel: The scheme I was describing would allow the length of a string to be dynamically changed to any value less than the length of the buffer. Effectively, one bit in the prefix says whether the buffer is full or not, and if the buffer isn't full then either there will a byte at the end to record that it's one byte short of full, two bytes to report that it's 1-255 bytes short of being full, or lots of bytes to report any amount of space short of being full. $\endgroup$
    – supercat
    Commented Nov 13, 2023 at 15:53
  • $\begingroup$ Yes, I understood that. I'm only saying that there are pre-existing schemes that do the same thing and might be considered more standardized. $\endgroup$ Commented Nov 13, 2023 at 20:57
  • $\begingroup$ @KarlKnechtel: I am unaware of any standard for any method that encapsulates both the size of the a buffer and its present length using only one more bit (or limiting the length range by a factor of two) compared with storing the size alone. $\endgroup$
    – supercat
    Commented yesterday
12
$\begingroup$

Null-terminated strings can't include the null byte, whereas length-prefixed ones can include all 256 possible binary values.

$\endgroup$
16
  • 2
    $\begingroup$ We're talking character strings here, not binary strings, so there's more than 256 possible values. Java, when exporting strings to C, will translate the null byte to C0 80 which is technically invalid in UTF-8 but which would decode to the null byte. $\endgroup$
    – prosfilaes
    Commented May 21, 2023 at 23:01
  • 1
    $\begingroup$ There's always modified UTF-8 but that way lies madness. $\endgroup$ Commented Jul 17, 2023 at 21:23
  • 1
    $\begingroup$ @prosfilaes: Many programs exist for the purpose of processing text which will be primarily processed by other programs, rather than humans. As such, there's no need to worry about variable-length code points and other gratuitous complications. $\endgroup$
    – supercat
    Commented Jul 18, 2023 at 21:46
  • 3
    $\begingroup$ @prosfilaes: More generally, a language that can work with arbitrary binary blobs will be able to, with proper application code, work with all present and future methods of encoding information as a sequence of octets By contrast, a language which tries to add special semantics to certain byte sequences will be unable to work with information encoded in ways for which such treatment would be inappropriate. $\endgroup$
    – supercat
    Commented Jul 19, 2023 at 17:00
  • 1
    $\begingroup$ @prosfilaes: Embedded devices vastly outnumber PCs and phones. With the exception of code whose purpose is to perform Unicode operations as a sequence of byte manipulations, most tasks that would involve Unicode text handling and all the corner cases associated therewith could be better accomplished in languages other than C. $\endgroup$
    – supercat
    Commented Jul 19, 2023 at 17:47
10
$\begingroup$

Length-prefixed can include capacity

This lets you "over-allocate" storage and remember how much you did so:

0D 10 48 65 6C 6C 6F 2C 20 57 6F 72 6C 64 21 00 00 00
^  ^  ^                                      ^
|  |  data...                                over-allocation
|  capacity
length

This is particularly useful for languages with mutable strings, as it allows append operations to be amortized O(1).

$\endgroup$
4
  • 2
    $\begingroup$ If you're just going to say "add a prefixing capacity" field, doesn't that apply equally well for null-terminated strings? who says you can't have a capacity prefix for a null-terminated string? $\endgroup$
    – starball
    Commented May 21, 2023 at 21:50
  • $\begingroup$ I suppose it's possible, but I don't know of any widespread implementations that prefix capacity but not length. $\endgroup$
    – Bbrk24
    Commented May 21, 2023 at 21:53
  • 3
    $\begingroup$ @starball You cannot do append operations in amortized O(1) on null-teminated strings with a capacity prefix because you still need to do find the end of the string, which is O(n). $\endgroup$
    – wastl
    Commented Jul 18, 2023 at 5:20
  • $\begingroup$ @wastl: One could do so if one specified that the last byte of the allocation must either be a zero if the string length is the allocation length minus one, a value 1-127 to indicate that there are 1-127 unused bytes following the null terminator, or a value 256-N to indicate that the preceding N bytes of storage hold the length of the unused portion of the storage, in little-endian format. $\endgroup$
    – supercat
    Commented Jul 18, 2023 at 21:49
8
$\begingroup$

Length-based strings have two non-obvious performance advantages:

Concatenating lots of them into a big chunk is faster as you can preallocate the correct buffer size by just looking at the lengths.

Furthermore, comparing lots of strings with different lengths but common prefixes is faster. An example of this is a set of fully qualified names like java.util.Set, java.util.HashSet, java.util.Collection, etc.

The non-obvious advantage of zero-terminated strings is that one doesn't have to think up front about the length of an index type, i.e. setting the length to e.g. 32bit won't create a compatibility issue someday.

$\endgroup$
1
  • 3
    $\begingroup$ On the flip side, zero-terminated strings will have a performance cost which will increase sharply as their length becomes significant relative to the CPU cache size--something that would happen long before a 32-bit length becomes a problem. $\endgroup$
    – supercat
    Commented Aug 17, 2023 at 17:26
6
$\begingroup$

Manipulation

Manipulating length-prefixed strings is a lot easier than manipulating null-terminated ones. For example, when you concatenate strings, you just have to add the lengths, while for the null-terminated case you have the extra null to remove. When splitting strings, you have to not forget to include the null. For length-prefixed strings, you just take the length of the array and halve it.

$\endgroup$
5
  • $\begingroup$ okay, but when splitting strings, who says you can do it without moving data? If you give more space to fit the length prefix than for each character, then don't you have to move data? Then it's not just a usability cost- it's a performance cost... $\endgroup$
    – starball
    Commented May 21, 2023 at 21:49
  • $\begingroup$ @starball i never said you can do it without moving data. $\endgroup$
    – Seggan
    Commented Nov 6, 2023 at 20:11
  • $\begingroup$ @starball: If code which expects to work with a string first calls a function to convert a string pointer into text address and length, and if special prefix values are reserved for data records that, instead of holding a string directly, hold the address and length of the actual string content, then one could pass arbitrary substrings to code which expects to receive a single "string pointer", without having to copy any string data (construct a record containing a header, the address of the start of the desired substring, and the length of the desired substrign). $\endgroup$
    – supercat
    Commented yesterday
  • $\begingroup$ @supercat sure. that's my understanding of what std::stringview is. but in the context o this question, I think it's debatable whether that counts as being the string itself. std::stringview, for example, is a non-reference-counted, non-owning view type. $\endgroup$
    – starball
    Commented yesterday
  • $\begingroup$ @starball: Can code easily accept pointers to std::string and std::stringview interchangeably? What I'd advocate would essentially be having a constructors for "readable string descriptor" and "mutable string descriptor" structures which could accept pointers to descriptors or minimal-storage-overhead in-place strings interchangeably. $\endgroup$
    – supercat
    Commented yesterday
4
$\begingroup$

Null-terminated strings seem to have many disadvantages, but the biggest problem is you usually have to also store a size somewhere else, to avoid vulnerabilities, making it seemingly redundant.

But there is an unexpected advantage to also have the null terminator, that is in simple routines such as scanning a string starting from a given point for a complete grapheme cluster, you'll only need to pass one pointer, without the start of the string or the size.

Similar situations are common in any case you use characters like "commands", say escapes in quotes and regexes, and printf and datetime formats. If you have a unified format for all the escapes, it won't have any problems. But if something has to be escaped in multiple layers, or the same thing has different meaning in different layers, the code quickly becomes ugly, and someone will want to choose a raw format in all but one layer instead. Strings and regexes usually doesn't have escapes with conflicting meaning, but the designer may not think making exceptions for regexes a good idea, and may not come up with a general interface for adding escapes or making escaped quotes and the escape characters optionally kept in the escaped format. But they could be improved if the designer does.

Null-terminated strings are a bit worse than escapes, but if we are not considering the buffer size part, they mix with other levels of escapes just like escapes. Sometimes a system message is already a text message in a format that is known to not contain nulls, so using nulls this way won't have a problem. And sometimes you know the string is a valid Unicode string, and you could escape the nulls like the Java way of using an overly long presentation. The former is an indication of something is stringly typed, which is frown upon by some people. The later implies fixing the character encoding, and either trust or presanitization of user data, which are not always desirable. They may seem strange for these reasons, but if your language is designed to work in the specific situation, you really didn't break anything.

Programmers would want a raw format only because it isn't sure what the other layers would be. But if another layer is already specified in the language, and already had some rules of what is valid and what is not, having a shared way of manipulating all "commands" may even improve something, for example the grapheme cluster case. In such cases, the null terminators could not be dismissed just because they are redundant when a clearly defined buffer size is a must. And buffer sizes could be dedicated for the maximum size instead of the current string length.

$\endgroup$
2
$\begingroup$

I commonly hear null-terminated strings referred to as C strings, and length-prefixed strings referred to as Pascal strings.

The advantages of null-terminated strings are that they save several bytes over length-prefixed strings. Instead of needing to store a length (at least two bytes), they can store a null character of just one byte.

The disadvantages of null-terminated strings are many. Indexing is O(n) instead of O(1) (though with Unicode, the codepoint indexing of length-prefixed strings is also O(n)). Any time the length is needed for ex. reallocation, it must be recomputed, which remains an O(n) operation. They cannot contain the null byte, as it is reserved. They are extremely unsafe. Length checking is expensive, so it is tempting for programs to not do it, which can easily cause buffer overflows - the standard gets function in C is the classic example of this.

$\endgroup$
5
  • 1
    $\begingroup$ From my experience, the term "Pascal strings" usually refers to strings that are prefixed by a single length byte, and whose length is thus limited to 255 bytes. As such, the overhead is the same as for a zero terminated string, but a byte greater than for a zero padded string. While zero-terminated strings don't have a hard length limit, efficiency goes downhill with length, so in many cases where code would use strings that would get longer than 255 bytes, neither approach to holding strings will be as good as tracking the length separately. $\endgroup$
    – supercat
    Commented Nov 2, 2023 at 20:59
  • $\begingroup$ It appears to be ambiguous. $\endgroup$
    – apropos
    Commented Nov 3, 2023 at 19:54
  • 1
    $\begingroup$ A length doesn't have to be at least 2 bytes. There could be a specialized short string with a 1 byte length as described above. $\endgroup$
    – qwr
    Commented Jul 19 at 19:50
  • $\begingroup$ "Instead of needing to store a length (at least two bytes)" This would benefit from clarification. It seems trivial to have one-byte length prefixes, or even variable-size length prefixes. $\endgroup$ Commented 2 days ago
  • $\begingroup$ "though with Unicode, the codepoint indexing of length-prefixed strings is also O(n)" This would also benefit from clarification. For example, Python has full unicode string support but O(1) indexing. $\endgroup$ Commented 2 days ago
2
$\begingroup$

If you use just one byte for the length you can not have strings longer than 256 characters. The disadvantage of storing the length of a string is, that you have to make a decision about the length of the integer used to store the length of the string. If you terminate your string with a reserved byte, you do not have to make this decision. It will work equally no matter how large your storage will be.

X11 for example made such a decision. They decided that the integer used for the size of the window has 16 bit (CARD16). Even today you can not create bigger windows.

$\endgroup$
1
  • $\begingroup$ Welcome to PLDI! $\endgroup$ Commented Nov 8, 2023 at 17:53
2
$\begingroup$

Short String Optimization

One way one could save having to allocate a separate buffer at a separate location is short string optimization. The idea is if the string is shorter than the size of a pointer, the data can be stored in the space of the pointer. However, this requires knowing the size of the string beforehand. An example of how this could be implemented in C would be:

typedef struct String {
    size_t len;
    union {
        char *ptr, data[sizeof(char *)];
    };
} String;

And if len < sizeof(char *), access data directly, otherwise, access ptr.

A language could have string optimizations like this built-in, but of course, this is only possible if the length is known before-hand, which is not possible with null-terminated strings.

$\endgroup$
2
$\begingroup$

Real-world usage of null-terminated strings

The only reason we still care about null-terminated strings is because of the widespread usage of C strings and all the APIs built for them. It's simple to write some basic C or assembly functions that process a string byte by byte and stop at a null byte:

while (*s) {
    putchar(*s);
    ++s;
}

Example compiled assembly from GCC -Os, with some function boilerplate omitted since I wrote it as a function:

                                   # s starts in rdi 
    mov     rbx, rdi               # 
L:  movsx   edi, BYTE PTR [rbx]    # while (*s)
    test    dil, dil               # 
    je      done                   # { 
    call    putchar                #   putchar(*s)
    inc     rbx                    #   ++s
    jmp     L                      # }

...but not that much simpler than comparing s to s+len (GCC outsmarted me and saved an instruction in the loop, because I just counted down len):

char* end = s + len;
while (s != end) {
    putchar(*s);
    ++s;
}
                                # s starts in rdi, len starts in rsi
    mov     rbx, rdi            #
    lea     rbp, [rdi+rsi]      # end = s + len
L:  cmp     rbx, rbp            # while (s != end) 
    je      done                # {
    movsx   edi, BYTE PTR [rbx] #   arg1 = *s   
    inc     rbx                 #   ++s  (gcc does this here for some reason)
    call    putchar             #   putchar(arg1)
    jmp     L                   # }

Vulnerabilities

The massive downside is that a null-terminated string requires the string to have a null byte at the end AND not earlier. The vulnerabilities are numerous and very costly. Can you spot the problem in each example?

#define MAXLEN 1024
...
char *pathbuf[MAXLEN];
...
read(cfgfile,inputbuf,MAXLEN); 
strcpy(pathbuf,inputbuf);
...
char *foo;
int counter;
foo=calloc(sizeof(char)*10);

for (counter=0;counter!=10;counter++) {
foo[counter]='a';

printf("%s\n",foo);
}
char *foo;
foo=malloc(sizeof(char)*5);
foo[0]='a';
foo[1]='a';
foo[2]=0;  // example doesn't compile here so I simplified it
foo[3]='c';
foo[4]='\0';  // example has missing ; here
printf("%c %c %c %c %c \n",foo[0],foo[1],foo[2],foo[3],foo[4]);
printf("%s\n",foo);
char firstname[20];
char lastname[20];
char fullname[40];

fullname[0] = '\0';

strncat(fullname, firstname, 20);
strncat(fullname, lastname, 20);

The strn* functions aren't always safe. strncat(dest, src, count) always puts a null character at the end, so can write up to count+1 bytes! But strncpy(dest, src, count) won't null-terminate the result if count is reached before the entire src is copied.

Length-based strings

All mainstream languages since C that have a string object have used length-based strings because C strings are so painful and error-prone. A length-based string is easy to implement too (except assembly that doesn't have structs). The most basic example is something like

struct string {
  size_t length;
  char *text;
};

A size_t len might take 4 or 8 bytes instead of 1 null-terminating byte, but many non-trivial string tasks require keeping track of the length of the string or buffer separately anyway, if you don't want to waste time re-computing it with strlen every time. Plus if those extra bytes really matter, you could do an initial length byte limited to 255 and support "short strings" with the same space overhead of a single byte, or a 2 byte length that can support a very reasonable 65535 characters at the cost of only 1 extra byte. From the Wikipedia article:

FreeBSD developer Poul-Henning Kamp, writing in ACM Queue, referred to the victory of null-terminated strings over a 2-byte (not one-byte) length as "the most expensive one-byte mistake" ever.

A dynamically-sized string or array could also track a size_t capacity for re-allocating when growing or shrinking. SDS strings are one implementation that combine a C-string with a prefix before the actual pointer and a null terminator at the end.

$\endgroup$
1
$\begingroup$

Zero-terminated strings have two main advantages:

  1. It's easy to design code that can process zero-terminated strings and zero-padded strings in known-length buffers interchangeably.

  2. Approaches involving prefixed strings involve engineering trade-offs; although in many cases, zero-terminated strings will be worse than even less-than-ideal prefixed forms, data interchange between prefixed string implementations which make different trade-offs will be awkward.

For many purposes, it would be useful to have a format where the byte pointed to by a "string pointer" will be one of the following, using different ranges of values to represent them:

  1. Part of the length of a length-prefixed string which fills the allocated space.

  2. Part of the length of a length-prefixed string buffer which is presently empty.

  3. Part of the length of a length-prefixed string buffer which is partially full, and whose last byte value X either indicates--if less than 128--that it is (X+1) bytes away from being full, or indicates that the preceding 256-X bytes hold the number of empty bytes at the end of the buffer.

  4. A pointer to a "type marker" byte for a structure resembling one of the following:

     struct readableStringRef {
       unsigned char typeMarker,dummy;
       unsigned length;
       char *content;
     };
    

    or

     struct writableStringRef {
       unsigned char typeMarker,dummy;
       unsigned length;
       char *content;
       unsigned buffer_size;
       unsigned (*adjust_func)(void* string, unsigned op, 
                               unsigned param, void *param2);
     };
    

Code which receives a pointer to a string in any of the above numbered formats could call one of two functions to convert it into one of the two structures in (4), based upon whether the code would need to modify the string's length. Code which needs to e.g. append something to a string could then be agnostic to whether it was stored in a buffer, in malloc storage, storage managed by some other means, etc. and code which needs to pass a substring to a function that will only read it could construct a readableStringRef identifying that substring, whether or not it was the tail, without having to copy or modify the original string.

$\endgroup$

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .