It has come to pass that there is a de facto universal delimiter in ASCII: the null character. Unix and the language C showed that you can build an entire platform in which the null character is banished from character strings, serving as a terminator in their representation. Other platforms have followed suit like Microsoft Windows.
Today, it's a virtually iron-clad guarantee that no textual datum contains a null byte. If a datum contains a null byte, it's binary and not text.
If you want to store a sequence of textual records or fields in a byte stream, if you separate them with nulls, you will have next to no issues. Nulls don't require any nonsense like escaping. If someone comes along and says they want to include a null byte in a text field, you can laugh them off as a comedian.
Examples of null separation in the wild:
Microsoft allows items in the registry to be multi-strings: single items containing multiple strings. This is stored as a sequence of null-terminated strings catenated together, with an extra null byte to terminate the whole sequence. As in "the\0quick\0brown\0fox\0\0"
to represent the list of strings "the"
, "quick"
, "brown"
, "fox"
.
On the Linux kernel, the environment variables of each process are available via the /proc
filesystem as /proc/<pid>/environ
. This virtual file uses null separation, like PATH=/bin:/usr/bin\0TERM=xterm\0...
.
Some GNU utilities have the option to produce null separated output, and that is precisely what allows them to be used to write much more robust scripts. GNU find
has a -print0
predicate for printing paths with null termination instead of newline separation. These paths can be fed to xargs -0
which reads null-separated strings from its standard input and turns them into command line arguments for a specified command. This combo will cleanly pass absolutely all file names/paths regardless of what they contain: because paths cannot contain a null byte.
Why do we play games with other separation? Tabs, commas, semicolons and whatnot, rather than just using null? The problem is that we need multiple levels of separation. Okay, so nulls chop the byte stream into texts, reliably. But within those texts, there may be another level of delimitation needed. It sometimes happens that a single string has more structure inside it. A path contains slashes to separate components. A MAC address uses colons to separate bytes. That sort of thing. An e-mail address has multiple levels of nested delimitation like local@domain
around the @
symbol, and then the domain part separated with dots. Parentheses are allowed in there and things like %
and !
. People write string-handling code to deal with these formats, and that string-handling code will not like null bytes in a lot of languages, due to the influence of C and Unix.
Demo of GNU Awk using the null byte as the field separator, processing /proc/self/environ
.
$ awk -F'\0' \
'{ for (i = 1; i <= NF; i++)
printf("field[%d] = %s\n", i, $i) }' \
/proc/self/environ
field[1] = CLUTTER_IM_MODULE=xim
field[2] = XDG_MENU_PREFIX=gnome-
field[3] = LANG=en_CA.UTF-8
field[4] = DISPLAY=:0
field[5] = OLDPWD=/home/kaz/tftproot
field[6] = GNOME_SHELL_SESSION_MODE=ubuntu
field[7] = EDITOR=vim
[ snip ... ]
field[54] = PATH=/home/kaz/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/home/kaz/bin:/home/kaz/bin
field[55] = GJS_DEBUG_TOPICS=JS ERROR;JS LOG
field[56] = SESSION_MANAGER=local/sun-go:@/tmp/.ICE-unix/1986,unix/sun-go:/tmp/.ICE-unix/1986
field[57] = GTK_IM_MODULE=ibus
field[58] = _=/usr/bin/awk
field[59] =
We get an extra blank field due to the null byte at the end, because Awk is treating it as a field separator, rather than terminator. However, this is possible precisely because GNU Awk allows for the null byte to be a constituent of character strings. The argument -F '\0'
is not required to work, according to the POSIX specification. POSIX says, in a table entitled "Escape Sequences in awk" that
\ddd
: A character followed by the longest sequence of one, two, or three octal-digit characters (01234567). If all of the digits are 0 (that is, representation of the NUL character), the behavior is undefined.
Thus it is entirely nonportable to rely on Awk to separate fields or records on the null byte. This kind of language problem is probably one reason we don't make more use of null characters.
#
to separate an anchor,?
to separate a query (which again typically has&
separating vatiable assignments with name and value separated by=
), then/
to separate parts that often correspond to folders, the hostname sprinkled with.
to delimit the structural parts of a domain name, sometimes an@
to delimit auth info fro that hostname, and:
to delimit username and password of auth