What are the advantages/disadvantages of null-terminated strings vs length-prefixed strings?

Question

There are two main ways to implement strings in a language: null-terminated and length-prefixed strings.

From Wikipedia:

A null-terminated string is a character string stored as an array containing the characters and terminated with a null character, often this is the null character (NUL), which has all bits zero.

and:

Length-prefixed

The length of a string can also be stored explicitly, for example by prefixing the string with the length as a byte value.

For example, the string "Hello, World!", when using null-terminated strings (using ASCII), is stored as:

48 65 6C 6C 6F 2C 20 57 6F 72 6C 64 21 00
                                       ^^
                                       the NUL character

and using length-prefixed string:

0D 48 65 6C 6C 6F 2C 20 57 6F 72 6C 64 21
^^
the length byte

These approach is vastly different, so what are the advantages of advantages/disadvantages of using one instead of another?

how is this really a language-related question as opposed to a computer science question? Because you mention "implementation" in the question but not "design", so I'm not sure as someone considering answering this question if you care about language design questions like usability- aside from which I'm not really sure if there's more to say than is just about the computer-science face of the question. — starball, Commented May 21, 2023 at 22:06
See also: stackoverflow.com/questions/25068903/what-are-pascal-strings — oɔɯǝɹ, Commented Aug 15, 2023 at 21:36
A common trick for ASCII strings (pre-unicode) was to use the 8th bit in the byte to mark the end of the string -- the last character would have it set (or clear) and all the others would have it clear (or set). — Chris Dodd, Commented Oct 28, 2023 at 9:38
You might be entertained by my very first blog post from 2003 which was on this subject. ericlippert.com/2003/09/12/… — Eric Lippert, Commented Jul 17 at 18:06

kaya3 · Accepted Answer · 2023-05-17 01:45:35Z

13

An option missing from the question is fat pointers ─ the type &str in Rust is an example of this. The length is not stored on the heap as a prefix to the string data, instead it is stored alongside the pointer, so that a reference to a string takes two words (length and pointer) instead of just one for a pointer.

This means that if there are multiple references to the same string, then the length data is duplicated compared to a length-prefixed string, which would only store the length once, where the string data is. But the upside is that a fat pointer can reference a substring without duplicating the string data on the heap.

In the diagram above (from the official Rust book), s is a String so it has a fat pointer to the whole string allocation (plus a capacity field, since it's a growable string), while world is a shared reference (i.e. a fat pointer) to a substring. This sharing would not be possible with length-prefixing, and would be possible with null-termination for substrings at the end of the string but not otherwise.

answered May 17, 2023 at 1:45

kaya3

20.1k40 silver badges120 bronze badges

6

$\begingroup$ Fat pointers are orthogonal to the question and are just a way to implement the length-based variant. $\endgroup$
– feldentm
Commented Jul 17, 2023 at 18:12
2

$\begingroup$ C++ strings (and string_views) follow this model, too (in most implementations). $\endgroup$
– Toby Speight
Commented Jul 18, 2023 at 8:27
$\begingroup$ This schema has a drawback when you want to modify the contents of the string, since you cannot change (nor free) those bytes. So you would have to use immutable strings and implement some kind of garbage collection... $\endgroup$
– Ángel
Commented Jul 20 at 1:32
$\begingroup$ Immutable strings can have Copy on Write $\endgroup$
– qwr
Commented yesterday
$\begingroup$ @qwr: In the absence of a tracing collector, code which wants to replace a reference to a possible-shared immutable string with a reference to a different string would need to do an synchronized decrement of the reference count and ensure that eliminating the last reference to a string would either free the original string or (if it had the desired length) recycle it into a non-sharable mutable string instead of creating a new object. $\endgroup$
– supercat
Commented yesterday

Add a comment |

Isaiah · Accepted Answer · 2023-05-17 00:59:34Z

12

Length-prefixed strings have the advantage of being able to find their length in O(1) time rather than O(n) time. This means you can find the end of the string more easily with the length prefix. They are also less error prone to use since you don't have to deal with forgetting to null terminate a string.

One disadvantage to length prefixed strings is that they require more space. In addition, you are limited in what the max size of the string can be based on how many bytes are used to store the length.

edited May 17, 2023 at 0:59

answered May 17, 2023 at 0:49

Isaiah

7891 gold badge6 silver badges27 bronze badges

6

$\begingroup$ Strings which are prefixed by a variable-length prefix would require the same amount of space as zero-terminated strings when they're short enough to encode the length in a single byte. One could encapsulate both buffer size and current length with a single byte of overhead [same as zero-termination] for strings or buffers up to 63 bytes, two bytes for strings up to 4095 bytes, and five bytes for strings up to 2GiB, while leaving a few prefix byte values available for other purposes. $\endgroup$
– supercat
Commented Jul 19, 2023 at 17:40
$\begingroup$ @supercat another way to do it is the ASN.1/DER scheme. $\endgroup$
– Karl Knechtel
Commented Nov 12, 2023 at 5:54
1

$\begingroup$ @KarlKnechtel: The scheme I was describing would allow the length of a string to be dynamically changed to any value less than the length of the buffer. Effectively, one bit in the prefix says whether the buffer is full or not, and if the buffer isn't full then either there will a byte at the end to record that it's one byte short of full, two bytes to report that it's 1-255 bytes short of being full, or lots of bytes to report any amount of space short of being full. $\endgroup$
– supercat
Commented Nov 13, 2023 at 15:53
$\begingroup$ Yes, I understood that. I'm only saying that there are pre-existing schemes that do the same thing and might be considered more standardized. $\endgroup$
– Karl Knechtel
Commented Nov 13, 2023 at 20:57
$\begingroup$ @KarlKnechtel: I am unaware of any standard for any method that encapsulates both the size of the a buffer and its present length using only one more bit (or limiting the length range by a factor of two) compared with storing the size alone. $\endgroup$
– supercat
Commented yesterday

Add a comment |

pppery · Accepted Answer · 2023-05-17 01:30:28Z

12

Null-terminated strings can't include the null byte, whereas length-prefixed ones can include all 256 possible binary values.

answered May 17, 2023 at 1:30

pppery

9305 silver badges18 bronze badges

2

$\begingroup$ We're talking character strings here, not binary strings, so there's more than 256 possible values. Java, when exporting strings to C, will translate the null byte to C0 80 which is technically invalid in UTF-8 but which would decode to the null byte. $\endgroup$
– prosfilaes
Commented May 21, 2023 at 23:01
1

$\begingroup$ There's always modified UTF-8 but that way lies madness. $\endgroup$
– Mike Samuel
Commented Jul 17, 2023 at 21:23
1

$\begingroup$ @prosfilaes: Many programs exist for the purpose of processing text which will be primarily processed by other programs, rather than humans. As such, there's no need to worry about variable-length code points and other gratuitous complications. $\endgroup$
– supercat
Commented Jul 18, 2023 at 21:46
3

$\begingroup$ @prosfilaes: More generally, a language that can work with arbitrary binary blobs will be able to, with proper application code, work with all present and future methods of encoding information as a sequence of octets By contrast, a language which tries to add special semantics to certain byte sequences will be unable to work with information encoded in ways for which such treatment would be inappropriate. $\endgroup$
– supercat
Commented Jul 19, 2023 at 17:00
1

$\begingroup$ @prosfilaes: Embedded devices vastly outnumber PCs and phones. With the exception of code whose purpose is to perform Unicode operations as a sequence of byte manipulations, most tasks that would involve Unicode text handling and all the corner cases associated therewith could be better accomplished in languages other than C. $\endgroup$
– supercat
Commented Jul 19, 2023 at 17:47

| Show 11 more comments

Bbrk24 · Accepted Answer · 2023-05-17 00:51:16Z

10

Length-prefixed can include capacity

This lets you "over-allocate" storage and remember how much you did so:

0D 10 48 65 6C 6C 6F 2C 20 57 6F 72 6C 64 21 00 00 00
^  ^  ^                                      ^
|  |  data...                                over-allocation
|  capacity
length

This is particularly useful for languages with mutable strings, as it allows append operations to be amortized O(1).

answered May 17, 2023 at 0:51

Bbrk24

9,1371 gold badge16 silver badges66 bronze badges

2

$\begingroup$ If you're just going to say "add a prefixing capacity" field, doesn't that apply equally well for null-terminated strings? who says you can't have a capacity prefix for a null-terminated string? $\endgroup$
– starball
Commented May 21, 2023 at 21:50
$\begingroup$ I suppose it's possible, but I don't know of any widespread implementations that prefix capacity but not length. $\endgroup$
– Bbrk24
Commented May 21, 2023 at 21:53
3

$\begingroup$ @starball You cannot do append operations in amortized O(1) on null-teminated strings with a capacity prefix because you still need to do find the end of the string, which is O(n). $\endgroup$
– wastl
Commented Jul 18, 2023 at 5:20
$\begingroup$ @wastl: One could do so if one specified that the last byte of the allocation must either be a zero if the string length is the allocation length minus one, a value 1-127 to indicate that there are 1-127 unused bytes following the null terminator, or a value 256-N to indicate that the preceding N bytes of storage hold the length of the unused portion of the storage, in little-endian format. $\endgroup$
– supercat
Commented Jul 18, 2023 at 21:49

Add a comment |

CPlus · Accepted Answer · 2024-07-16 21:30:17Z

8

Length-based strings have two non-obvious performance advantages:

Concatenating lots of them into a big chunk is faster as you can preallocate the correct buffer size by just looking at the lengths.

Furthermore, comparing lots of strings with different lengths but common prefixes is faster. An example of this is a set of fully qualified names like java.util.Set, java.util.HashSet, java.util.Collection, etc.

The non-obvious advantage of zero-terminated strings is that one doesn't have to think up front about the length of an index type, i.e. setting the length to e.g. 32bit won't create a compatibility issue someday.

edited Jul 16 at 21:30

CPlus

8,8732 gold badges31 silver badges89 bronze badges

answered Jul 17, 2023 at 18:17

feldentm

1,9853 silver badges17 bronze badges

3

$\begingroup$ On the flip side, zero-terminated strings will have a performance cost which will increase sharply as their length becomes significant relative to the CPU cache size--something that would happen long before a 32-bit length becomes a problem. $\endgroup$
– supercat
Commented Aug 17, 2023 at 17:26

Add a comment |

Seggan · Accepted Answer · 2023-11-06 20:11:36Z

6

Manipulation

Manipulating length-prefixed strings is a lot easier than manipulating null-terminated ones. For example, when you concatenate strings, you just have to add the lengths, while for the null-terminated case you have the extra null to remove. When splitting strings, you have to not forget to include the null. For length-prefixed strings, you just take the length of the array and halve it.

edited Nov 6, 2023 at 20:11

answered May 17, 2023 at 0:52

Seggan

2,7539 silver badges41 bronze badges

$\begingroup$ okay, but when splitting strings, who says you can do it without moving data? If you give more space to fit the length prefix than for each character, then don't you have to move data? Then it's not just a usability cost- it's a performance cost... $\endgroup$
– starball
Commented May 21, 2023 at 21:49
$\begingroup$ @starball i never said you can do it without moving data. $\endgroup$
– Seggan
Commented Nov 6, 2023 at 20:11
$\begingroup$ @starball: If code which expects to work with a string first calls a function to convert a string pointer into text address and length, and if special prefix values are reserved for data records that, instead of holding a string directly, hold the address and length of the actual string content, then one could pass arbitrary substrings to code which expects to receive a single "string pointer", without having to copy any string data (construct a record containing a header, the address of the start of the desired substring, and the length of the desired substrign). $\endgroup$
– supercat
Commented yesterday
$\begingroup$ @supercat sure. that's my understanding of what std::stringview is. but in the context o this question, I think it's debatable whether that counts as being the string itself. std::stringview, for example, is a non-reference-counted, non-owning view type. $\endgroup$
– starball
Commented yesterday
$\begingroup$ @starball: Can code easily accept pointers to std::string and std::stringview interchangeably? What I'd advocate would essentially be having a constructors for "readable string descriptor" and "mutable string descriptor" structures which could accept pointers to descriptors or minimal-storage-overhead in-place strings interchangeably. $\endgroup$
– supercat
Commented yesterday

Add a comment |

user23013 · Accepted Answer · 2023-08-15 21:15:58Z

Null-terminated strings seem to have many disadvantages, but the biggest problem is you usually have to also store a size somewhere else, to avoid vulnerabilities, making it seemingly redundant.

But there is an unexpected advantage to also have the null terminator, that is in simple routines such as scanning a string starting from a given point for a complete grapheme cluster, you'll only need to pass one pointer, without the start of the string or the size.

Similar situations are common in any case you use characters like "commands", say escapes in quotes and regexes, and printf and datetime formats. If you have a unified format for all the escapes, it won't have any problems. But if something has to be escaped in multiple layers, or the same thing has different meaning in different layers, the code quickly becomes ugly, and someone will want to choose a raw format in all but one layer instead. Strings and regexes usually doesn't have escapes with conflicting meaning, but the designer may not think making exceptions for regexes a good idea, and may not come up with a general interface for adding escapes or making escaped quotes and the escape characters optionally kept in the escaped format. But they could be improved if the designer does.

Null-terminated strings are a bit worse than escapes, but if we are not considering the buffer size part, they mix with other levels of escapes just like escapes. Sometimes a system message is already a text message in a format that is known to not contain nulls, so using nulls this way won't have a problem. And sometimes you know the string is a valid Unicode string, and you could escape the nulls like the Java way of using an overly long presentation. The former is an indication of something is stringly typed, which is frown upon by some people. The later implies fixing the character encoding, and either trust or presanitization of user data, which are not always desirable. They may seem strange for these reasons, but if your language is designed to work in the specific situation, you really didn't break anything.

Programmers would want a raw format only because it isn't sure what the other layers would be. But if another layer is already specified in the language, and already had some rules of what is valid and what is not, having a shared way of manipulating all "commands" may even improve something, for example the grapheme cluster case. In such cases, the null terminators could not be dismissed just because they are redundant when a clearly defined buffer size is a must. And buffer sizes could be dedicated for the maximum size instead of the current string length.

apropos · Accepted Answer · 2023-10-26 23:41:00Z

2

I commonly hear null-terminated strings referred to as C strings, and length-prefixed strings referred to as Pascal strings.

The advantages of null-terminated strings are that they save several bytes over length-prefixed strings. Instead of needing to store a length (at least two bytes), they can store a null character of just one byte.

The disadvantages of null-terminated strings are many. Indexing is O(n) instead of O(1) (though with Unicode, the codepoint indexing of length-prefixed strings is also O(n)). Any time the length is needed for ex. reallocation, it must be recomputed, which remains an O(n) operation. They cannot contain the null byte, as it is reserved. They are extremely unsafe. Length checking is expensive, so it is tempting for programs to not do it, which can easily cause buffer overflows - the standard gets function in C is the classic example of this.

answered Oct 26, 2023 at 23:41

apropos

1,0783 silver badges17 bronze badges

1

$\begingroup$ From my experience, the term "Pascal strings" usually refers to strings that are prefixed by a single length byte, and whose length is thus limited to 255 bytes. As such, the overhead is the same as for a zero terminated string, but a byte greater than for a zero padded string. While zero-terminated strings don't have a hard length limit, efficiency goes downhill with length, so in many cases where code would use strings that would get longer than 255 bytes, neither approach to holding strings will be as good as tracking the length separately. $\endgroup$
– supercat
Commented Nov 2, 2023 at 20:59
$\begingroup$ It appears to be ambiguous. $\endgroup$
– apropos
Commented Nov 3, 2023 at 19:54
1

$\begingroup$ A length doesn't have to be at least 2 bytes. There could be a specialized short string with a 1 byte length as described above. $\endgroup$
– qwr
Commented Jul 19 at 19:50
$\begingroup$ "Instead of needing to store a length (at least two bytes)" This would benefit from clarification. It seems trivial to have one-byte length prefixes, or even variable-size length prefixes. $\endgroup$
– MisterMiyagi
Commented 2 days ago
$\begingroup$ "though with Unicode, the codepoint indexing of length-prefixed strings is also O(n)" This would also benefit from clarification. For example, Python has full unicode string support but O(1) indexing. $\endgroup$
– MisterMiyagi
Commented 2 days ago

Add a comment |

ceving · Accepted Answer · 2023-11-08 17:51:51Z

2

If you use just one byte for the length you can not have strings longer than 256 characters. The disadvantage of storing the length of a string is, that you have to make a decision about the length of the integer used to store the length of the string. If you terminate your string with a reserved byte, you do not have to make this decision. It will work equally no matter how large your storage will be.

X11 for example made such a decision. They decided that the integer used for the size of the window has 16 bit (CARD16). Even today you can not create bigger windows.

answered Nov 8, 2023 at 17:51

ceving

3761 silver badge9 bronze badges

$\begingroup$ Welcome to PLDI! $\endgroup$
– Rydwolf Programs
Commented Nov 8, 2023 at 17:53

Add a comment |

CPlus · Accepted Answer · 2023-11-08 18:03:17Z

Short String Optimization

One way one could save having to allocate a separate buffer at a separate location is short string optimization. The idea is if the string is shorter than the size of a pointer, the data can be stored in the space of the pointer. However, this requires knowing the size of the string beforehand. An example of how this could be implemented in C would be:

typedef struct String {
    size_t len;
    union {
        char *ptr, data[sizeof(char *)];
    };
} String;

And if len < sizeof(char *), access data directly, otherwise, access ptr.

A language could have string optimizations like this built-in, but of course, this is only possible if the length is known before-hand, which is not possible with null-terminated strings.

qwr · Accepted Answer · 2024-07-21 00:05:46Z

Real-world usage of null-terminated strings

The only reason we still care about null-terminated strings is because of the widespread usage of C strings and all the APIs built for them. It's simple to write some basic C or assembly functions that process a string byte by byte and stop at a null byte:

while (*s) {
    putchar(*s);
    ++s;
}

Example compiled assembly from GCC -Os, with some function boilerplate omitted since I wrote it as a function:

                                   # s starts in rdi 
    mov     rbx, rdi               # 
L:  movsx   edi, BYTE PTR [rbx]    # while (*s)
    test    dil, dil               # 
    je      done                   # { 
    call    putchar                #   putchar(*s)
    inc     rbx                    #   ++s
    jmp     L                      # }

...but not that much simpler than comparing s to s+len (GCC outsmarted me and saved an instruction in the loop, because I just counted down len):

char* end = s + len;
while (s != end) {
    putchar(*s);
    ++s;
}

                                # s starts in rdi, len starts in rsi
    mov     rbx, rdi            #
    lea     rbp, [rdi+rsi]      # end = s + len
L:  cmp     rbx, rbp            # while (s != end) 
    je      done                # {
    movsx   edi, BYTE PTR [rbx] #   arg1 = *s   
    inc     rbx                 #   ++s  (gcc does this here for some reason)
    call    putchar             #   putchar(arg1)
    jmp     L                   # }

Vulnerabilities

The massive downside is that a null-terminated string requires the string to have a null byte at the end AND not earlier. The vulnerabilities are numerous and very costly. Can you spot the problem in each example?

CWE-170: Improper Null Termination

#define MAXLEN 1024
...
char *pathbuf[MAXLEN];
...
read(cfgfile,inputbuf,MAXLEN); 
strcpy(pathbuf,inputbuf);
...

CWE-463: Deletion of Data Structure Sentinel (sentinel being the null-terminator)

char *foo;
int counter;
foo=calloc(sizeof(char)*10);

for (counter=0;counter!=10;counter++) {
foo[counter]='a';

printf("%s\n",foo);
}

CWE-464: Addition of Data Structure Sentinel (Null byte injection)

char *foo;
foo=malloc(sizeof(char)*5);
foo[0]='a';
foo[1]='a';
foo[2]=0;  // example doesn't compile here so I simplified it
foo[3]='c';
foo[4]='\0';  // example has missing ; here
printf("%c %c %c %c %c \n",foo[0],foo[1],foo[2],foo[3],foo[4]);
printf("%s\n",foo);

CWE-193: Off-by-one Error especially common in string length calculations. This kind of thing is really easy to overlook:

char firstname[20];
char lastname[20];
char fullname[40];

fullname[0] = '\0';

strncat(fullname, firstname, 20);
strncat(fullname, lastname, 20);

The strn* functions aren't always safe. strncat(dest, src, count) always puts a null character at the end, so can write up to count+1 bytes! But strncpy(dest, src, count) won't null-terminate the result if count is reached before the entire src is copied.

Length-based strings

All mainstream languages since C that have a string object have used length-based strings because C strings are so painful and error-prone. A length-based string is easy to implement too (except assembly that doesn't have structs). The most basic example is something like

struct string {
  size_t length;
  char *text;
};

A size_t len might take 4 or 8 bytes instead of 1 null-terminating byte, but many non-trivial string tasks require keeping track of the length of the string or buffer separately anyway, if you don't want to waste time re-computing it with strlen every time. Plus if those extra bytes really matter, you could do an initial length byte limited to 255 and support "short strings" with the same space overhead of a single byte, or a 2 byte length that can support a very reasonable 65535 characters at the cost of only 1 extra byte. From the Wikipedia article:

FreeBSD developer Poul-Henning Kamp, writing in ACM Queue, referred to the victory of null-terminated strings over a 2-byte (not one-byte) length as "the most expensive one-byte mistake" ever.

A dynamically-sized string or array could also track a size_t capacity for re-allocating when growing or shrinking. SDS strings are one implementation that combine a C-string with a prefix before the actual pointer and a null terminator at the end.

Toby Speight · Accepted Answer · 2023-07-18 11:41:16Z

Zero-terminated strings have two main advantages:

It's easy to design code that can process zero-terminated strings and zero-padded strings in known-length buffers interchangeably.
Approaches involving prefixed strings involve engineering trade-offs; although in many cases, zero-terminated strings will be worse than even less-than-ideal prefixed forms, data interchange between prefixed string implementations which make different trade-offs will be awkward.

For many purposes, it would be useful to have a format where the byte pointed to by a "string pointer" will be one of the following, using different ranges of values to represent them:

Part of the length of a length-prefixed string which fills the allocated space.
Part of the length of a length-prefixed string buffer which is presently empty.
Part of the length of a length-prefixed string buffer which is partially full, and whose last byte value X either indicates--if less than 128--that it is (X+1) bytes away from being full, or indicates that the preceding 256-X bytes hold the number of empty bytes at the end of the buffer.

A pointer to a "type marker" byte for a structure resembling one of the following:

 struct readableStringRef {
   unsigned char typeMarker,dummy;
   unsigned length;
   char *content;
 };

or

 struct writableStringRef {
   unsigned char typeMarker,dummy;
   unsigned length;
   char *content;
   unsigned buffer_size;
   unsigned (*adjust_func)(void* string, unsigned op, 
                           unsigned param, void *param2);
 };

Code which receives a pointer to a string in any of the above numbered formats could call one of two functions to convert it into one of the two structures in (4), based upon whether the code would need to modify the string's length. Code which needs to e.g. append something to a string could then be agnostic to whether it was stored in a buffer, in malloc storage, storage managed by some other means, etc. and code which needs to pass a substring to a function that will only read it could construct a readableStringRef identifying that substring, whether or not it was the tail, without having to copy or modify the original string.

Stack Exchange Network

What are the advantages/disadvantages of null-terminated strings vs length-prefixed strings?

12 Answers 12

Length-prefixed can include capacity

Manipulation

Short String Optimization

Real-world usage of null-terminated strings

Vulnerabilities

Length-based strings

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged
implementation
strings
.

Hot Network Questions

What are the advantages/disadvantages of null-terminated strings vs length-prefixed strings?

12 Answers 12

Length-prefixed can include capacity

Manipulation

Short String Optimization

Real-world usage of null-terminated strings

Vulnerabilities

Length-based strings

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged implementationstrings.

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
implementation
strings
.