32

My question is: Why is there no specific "delimiter" character? One that would be used for all types of delimiting. We have special characters for new line, print-settings, etc...

Why do we sometimes use commas, spaces, tabs etc. if those are common text characters. Is there history behind this? Like maybe they didn't need delimiter characters when ASCII or the like was made?

(What would seem to make sense to me: Have a special delimiter character that it's sole purpose it to "delimit" separate values when needed)

6
  • 30
    xkcd.com/927 pretty much sums it up
    – Mokubai
    Commented Apr 28, 2021 at 0:08
  • 62
    According to Wikipedia, ASCII has four different separators (28-31) for files, groups, records and units, respectively. Commented Apr 28, 2021 at 1:30
  • 4
    So many reasons.. see @Mokubai comment. :) Of course they needed delimiters even way back in the beginning. One of the primary reasons is flexibility. There is no telling what my data might contain. For instance.. what if a comma was mandated as "the char" (is usually is for csv of course) but my data itself is riddled with them. Sure I could escape every one of them.. or instead it could be flexible enough to let me choose on my own. Also, file systems weren't invented by the same people that made databases, or anything else that needed to be delimited. Commented Apr 28, 2021 at 2:58
  • 6
    Just look at something in everiday use such as an URL, which may contain lots of different delimiter characters, but at different "levels" so that using a single delimiter would lead to confusion: # to separate an anchor, ? to separate a query (which again typically has & separating vatiable assignments with name and value separated by =), then / to separate parts that often correspond to folders, the hostname sprinkled with . to delimit the structural parts of a domain name, sometimes an @ to delimit auth info fro that hostname, and : to delimit username and password of auth Commented Apr 28, 2021 at 21:26
  • 5
    @farrenthorpe I'm a person, not a delimiter!
    – pipe
    Commented Apr 30, 2021 at 12:04

6 Answers 6

82

Delimiters already exist in ASCII. Decimal 28-31 (hex 1C-1F) are delimiters. This includes the file, record, group and unit separators.

I would assume we do not use them, as it is easier to type keyboard characters that do not require multiple keys to type a single character. This also allows for easier interchange between different formats, as well. Comma separated values will work on virtually any system, ASCII compliant, or not.

12
  • 48
    Comma separated values don't work well in Europe, where we have decimal commas en.wikipedia.org/wiki/… The Microsoft use of tab as a field separator on the clipboard works quite well, but is not standard in files.
    – grahamj42
    Commented Apr 28, 2021 at 10:50
  • 17
    ASCII is from the 60's when teletypes were all the rage (which is why 7 is BEL). At that time I think most if not all operating systems expected files to have records, a schema and work like accessing rows in databases. UNIX comes around in the 70's with it's simplified "worse is better" philosophy and basically introduces the notion that everything should be human readable text where possible, files and I/O are just a stream of bytes, and individual programs are responsible for file structure. So I think at that point the idea of special delimiter codes fell out of fashion.
    – LawrenceC
    Commented Apr 28, 2021 at 17:51
  • 10
    Note also that most text editors do not display any of these characters (or in some cases only display them optionally), so while that's great for computers to read, it's not so nice for other humans. Commented Apr 28, 2021 at 20:43
  • 17
    @grahamj42 One response to "don't work well in Europe" is to expand the 'A' in "ASCII" :-)
    – Kapil
    Commented Apr 29, 2021 at 2:52
  • 9
    I have actually written programs back in the day that used the ASCII separators, which as I recall were originally intended for MagTape formatting. Problem is, they didn’t work very well with binary (a problem shared by C’s null-terminated strings), so they fell out of use pretty quickly. Commented Apr 29, 2021 at 15:11
52

As already noted, ASCII includes delimiters. The problem is not that an extra key is needed during data entry to include the delimiters - Control is no harder to use than Shift for an UPPER-case or other special printable character (e.g., !@#$). The problem is that traditionally those control characters are not directly visible. Even tab, carriage return and line feed - which produce immediate actions, do not produce visible output.

You can't tell the difference on a teletype between tabs and spaces. You can't tell the difference between line feeds and spaces to end-of-line + wrap to next line. Similarly, the delimiters do not have a defined printable image. They may show in some (modern) text editors, and they may produce immediate actions in various devices, but they don't leave a mark.

All of this doesn't matter if data is only designed to be machine-readable - i.e., what we commonly refer to as binary files. But text for data entry and transfer between systems is often, deliberately, human-readable. If it is going to be human-readable, the delimiters need to be printable.

4
  • 16
    This is the real reason. Even if people come up with a delimiter most people would still use what they are familiar with (commas, JSON, XML) because data is source code, not executable code. And that is in fact what has historically happened since delimiters DO EXIST since the beginning but people ignore them and invent human readable syntax for data
    – slebetman
    Commented Apr 28, 2021 at 22:30
  • 6
    I've actually had to deal with files that use the 1C-1F characters. They are used in some standards. There's really no reason why editors can't show some symbol for these values, but they generally don't.
    – JimmyJames
    Commented Apr 29, 2021 at 16:48
  • 4
    This is kind of a Catch-22. In order for the delimiter to be part of a human readable format, it has to be visible. But if it's a visible character, someone is going to use it in their text, possibly because what you consider a single string is itself actually broken up into multiple parts (see Kaz's answer).
    – Joe Sewell
    Commented Apr 30, 2021 at 19:29
  • 1
    @JoeSewell Just re-visited my question again, and this is my favorite comment. I would mark it as an answer if it wasn't a comment :)
    – Dave
    Commented Apr 1, 2022 at 15:51
19

As was mentioned in another answer, ASCII does have delimiters. Looking here [1], these are mentioned:

code point name
U+001C File Separator
U+001D Group Separator
U+001E Record Separator
U+001F Unit Separator

and these are used. For example, U+001C (octal 34), is the default SUBSEP [2] string for GNU AWK.

  1. https://wikipedia.org/wiki/ASCII#Control_code_chart
  2. https://gnu.org/software/gawk/manual/html_node/Multidimensional
3
  • 6
    There are also start of header, start of text and end of text. End of transmission would sort of count as well, I guess. Commented Apr 28, 2021 at 17:55
  • 5
    Referring to ASCII codes with Unicode code point notation (U+) is rather anachronistic. Commented Apr 30, 2021 at 11:05
  • 4
    I've always thought it's sad that these are not used more often. I'm sure ever developer has spent too much time figuring out how to delimit things, and everyone invents their own way. +, |, ,, ;, etc.
    – pipe
    Commented Apr 30, 2021 at 12:02
10

This is mainly historical.

In the old days of informatics, data files were mostly Fixed Width Fields files because it was the natural IO for languages like Fortran IV and COBOL: n characters for first field, m for second, etc.

Then C language provided a scanf function that splitted input on (sets of) white spaces, and people started to use free format for data files containing numbers. But that lead to messy results when some fields could contain spaces (scanf is known as a poor man's parser). So as the other standard function for splitting was strtok which used one single delimiter, most (English speaking) people started to use the comma (,) as separator, because it is easy to manually write a Comma Separated Value file in a text editor.

Then National Language Support came into the game... In some European languages (French), the decimal point is the comma. IT guys were used to the decimal point but less techies were not, so French versions of Windows started to define the semicolon (;) as the separator to allow the comma in decimal numbers.

In the meanwhile, some realized that when fields always had close length, a tab character (which existed on all keyboards) allowed to provide a nice vertical alignment and that was the reason for a third standard.

Finaly, standardization began to be a fact, and RFC 4180 emerged in 2005. It did define the comma to be the official separator, but as Windows had decided to play the NLS game, tools and libraries wanting to process real files had to adapt to various possible delimiters.

And that is the reason why in 2021, we have many possible delimiters in CSV files...

7

It has come to pass that there is a de facto universal delimiter in ASCII: the null character. Unix and the language C showed that you can build an entire platform in which the null character is banished from character strings, serving as a terminator in their representation. Other platforms have followed suit like Microsoft Windows.

Today, it's a virtually iron-clad guarantee that no textual datum contains a null byte. If a datum contains a null byte, it's binary and not text.

If you want to store a sequence of textual records or fields in a byte stream, if you separate them with nulls, you will have next to no issues. Nulls don't require any nonsense like escaping. If someone comes along and says they want to include a null byte in a text field, you can laugh them off as a comedian.

Examples of null separation in the wild:

  1. Microsoft allows items in the registry to be multi-strings: single items containing multiple strings. This is stored as a sequence of null-terminated strings catenated together, with an extra null byte to terminate the whole sequence. As in "the\0quick\0brown\0fox\0\0" to represent the list of strings "the", "quick", "brown", "fox".

  2. On the Linux kernel, the environment variables of each process are available via the /proc filesystem as /proc/<pid>/environ. This virtual file uses null separation, like PATH=/bin:/usr/bin\0TERM=xterm\0....

  3. Some GNU utilities have the option to produce null separated output, and that is precisely what allows them to be used to write much more robust scripts. GNU find has a -print0 predicate for printing paths with null termination instead of newline separation. These paths can be fed to xargs -0 which reads null-separated strings from its standard input and turns them into command line arguments for a specified command. This combo will cleanly pass absolutely all file names/paths regardless of what they contain: because paths cannot contain a null byte.

Why do we play games with other separation? Tabs, commas, semicolons and whatnot, rather than just using null? The problem is that we need multiple levels of separation. Okay, so nulls chop the byte stream into texts, reliably. But within those texts, there may be another level of delimitation needed. It sometimes happens that a single string has more structure inside it. A path contains slashes to separate components. A MAC address uses colons to separate bytes. That sort of thing. An e-mail address has multiple levels of nested delimitation like local@domain around the @ symbol, and then the domain part separated with dots. Parentheses are allowed in there and things like % and !. People write string-handling code to deal with these formats, and that string-handling code will not like null bytes in a lot of languages, due to the influence of C and Unix.

Demo of GNU Awk using the null byte as the field separator, processing /proc/self/environ.

$ awk -F'\0' \
      '{ for (i = 1; i <= NF; i++) 
           printf("field[%d] = %s\n", i, $i) }' \
      /proc/self/environ
field[1] = CLUTTER_IM_MODULE=xim
field[2] = XDG_MENU_PREFIX=gnome-
field[3] = LANG=en_CA.UTF-8
field[4] = DISPLAY=:0
field[5] = OLDPWD=/home/kaz/tftproot
field[6] = GNOME_SHELL_SESSION_MODE=ubuntu
field[7] = EDITOR=vim
[ snip ... ]
field[54] = PATH=/home/kaz/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/home/kaz/bin:/home/kaz/bin
field[55] = GJS_DEBUG_TOPICS=JS ERROR;JS LOG
field[56] = SESSION_MANAGER=local/sun-go:@/tmp/.ICE-unix/1986,unix/sun-go:/tmp/.ICE-unix/1986
field[57] = GTK_IM_MODULE=ibus
field[58] = _=/usr/bin/awk
field[59] = 

We get an extra blank field due to the null byte at the end, because Awk is treating it as a field separator, rather than terminator. However, this is possible precisely because GNU Awk allows for the null byte to be a constituent of character strings. The argument -F '\0' is not required to work, according to the POSIX specification. POSIX says, in a table entitled "Escape Sequences in awk" that

\ddd: A character followed by the longest sequence of one, two, or three octal-digit characters (01234567). If all of the digits are 0 (that is, representation of the NUL character), the behavior is undefined.

Thus it is entirely nonportable to rely on Awk to separate fields or records on the null byte. This kind of language problem is probably one reason we don't make more use of null characters.

2

To illuminate the history given by @SergeBallesta a bit more, back in the nascent days of the newly adopted ASCII (we are talking mainframe), the general purpose was to standardize input codes between systems so that everybody was on the same page. There was a lot of tussle between systems manufacturers to keep their products proprietary (essentially stuff being useable only on their systems) and this was detrimental to portability. This issue primarily related to taking a program, or input and output, from one system to another. For instance, one could write an output tape having data-input files, some output files, and some FORTRAN program files that were used on one system, take that tape to another system made by a different manufacturer, and find the tape was not readable. The big guys in the room, IBM, had a good standardized platform, EBCDIC, which was adaptable as ASCII with only a minor change in the binary coding of the EBCDIC character set. Everybody got on board with that. Up to then, the only standardized character set was on a typewriter!

However, back at the ranch, programming was largely related to simply reading input data that was in a programmer-determined format, manipulation of those data in a program, and production of an output that was also in a programmer-determined format. There was no need for a delimiter. One of the most intensive uses of formatted input and output was with the FORTRAN programming language. For instance, data would be keypunched on 80 column Hollerith cards in a specific, organized input format determined by the person who programmed that input-segment of the program. Everything was formatted, standardized, designed by the programmer/user. There was no such thing as comma separation of input data. Output was printed on 132 column broadsheet, edge-perforated paper. Output was also punched on 80 column cards. Programmers were expected to have an organized, tabulated output that was easily read. Standardized format input and output did not require a delimiter to separate input data. Everything was tabulated nicely. Input data was neatly tab separated; printed output was in tabulated columns with headings, everything nice and tidy.

One must remember that in FORTRAN, all aspects of printer carriage control were possible for the programmer. In fact, the programmer was totally responsible for understanding how to operate a 132 column printer from within a FORTRAN program, as well as how to represent output in computer memory for efficient output to a printer, or to tape, where said file could be live-printed, reviewed on a terminal, or later printed.

With the advent of personalized desktop computing, all of this changed because data input and output became totally electronic. Yes, there were still formatted input and output files, but the programming environment became more interactive with the user's needs. A file with formatted input as images of 80 column cards, or pages of 132 column output, could be read card-by-card or line-by-line by an input pre-processor (a subroutine), comma delimited by looking for trailing blank spaces on required input, and rewritten to a temporary file in memory. FORTRAN compatible formatted hard-copy input was not necessary. This was all very easy with a standardized character set, and ASCII made this standardization possible. In fact, the key to doing this, the comma, was already in ASCII. The temporary file could then be re-read by software that specifically used input from comma-delimited data files. Now everything changed! Only a few short years were required to relegate the mainframe to history, and advance the state of the art to a whole new plane.

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .