60

In C, '' is used to denote a character, while "" is used to denote a string. Why was this syntax chosen?

I tried to research this using Wikipedia’s Timeline of Programming Languages along with Rosetta Code’s reference page for strings. It seems that C was the first widespread programming language which implemented this, since in popular languages before it like Pascal, ALGOL, COBOL and FORTRAN, '' and "" were interchangeable, or only one of them was used.

I know that it might seem like an obvious choice to use '' for characters and "" for strings, but it actually isn’t. Before programming, these symbols were only used in punctuation, and there is no such rule or convention that '' should be used when quoting smaller things.

Since I found Why was `!` chosen for negation? and Why was "C:" chosen for the first hard drive partition? on this SE site, I figured that this is the right place to ask this.

20
  • 8
    Welcome to Retrocomputing! Yes, this is the right site for this question. (Indeed, I'm surprised it hasn't already been asked.)
    – DrSheldon
    Commented Jun 25, 2021 at 16:21
  • 4
    @jamesqf Not sure what you mean. Those uses ARE punctuation.
    – barbecue
    Commented Jun 26, 2021 at 4:15
  • 5
    @jamesqf where on earth did you get that idea?
    – barbecue
    Commented Jun 26, 2021 at 16:33
  • 8
    @jamesqf I've also reading and writing English for the past several decades, and I have never seen such a restrictive definition of punctuation marks for modern English. The purpose of punctuation hasn't been just to identify pauses for centuries. The elocutionary definition you're using fell out of favor in the 17th century, when the syntactic school became prominent. Punctuation identifies not just pauses, but ways to clarify syntax. Since you are rejecting any sources from the web, I won't bother to provide links to Britannica, the OED, or other unreliable online sources.
    – barbecue
    Commented Jun 27, 2021 at 20:10
  • 4
    @jamesdlin, in American English, not 'typical' English. In British English it's (traditionally) the opposite: single quotes as the main, double quotes as the inner ones. From here, a curious observation: C (with its double quotes for strings) was invented by the Americans, while Pascal (with single quotes) by a European (not a Brit, but Europeans tend[ed] to learn British English).
    – Zeus
    Commented Jun 28, 2021 at 1:11

4 Answers 4

57

For type system reasons, and for compatibility with B.

B is a programming language that served as the immediate ancestor of C. The salient thing about B is that it had no type system: all values in B are machine words (corresponding to the C type int). In B, there were two ways to represent strings in source code: string literals0, which evaluated to a pointer to a block of memory holding the string, and multi-character literals, which packed multiple character codes directly into a single machine word. The latter were famously used in Kernighan’s original “Hello, world” program:

main( ) {
 extrn a, b, c;
 putchar(a); putchar(b); putchar(c); putchar('!*n');
}

a 'hell';
b 'o, w';
c 'orld';

Since the two kinds of values behaved so differently, yet could not be distinguished at the type system level (because there was none), they had to use different syntaxes.

As C is an evolution of B, it simply inherited all this baggage and could not change it without breaking compatibility. Although some breaking changes to the syntax were made in C after all, there apparently wasn’t a compelling enough reason to make one here. The weak typing of C does maintain a certain kind of continuity with B after all.

The above, though, raises the question of why such distinction was made in B. Since B was conceived as a simplified version of BCPL, one may think there might be some answers in materials about that language. But according to the manual, in BCPL character literals and string literals were not differentiated by delimiters, but by their length:

A string constant of length one has an Rvalue which is the bit pattern representation of the character; this is right justified and filled with zeros.

A string constant with length other than one is represented as a BCPL vector [i.e. array]; the length and the string characters are packed in successive words of the vector.

So the delimiter distinction between character literals and string literals was first made in B. As to why, and why the syntax was chosen the way it was, we are probably resigned to rely on speculation, as neither Users' Reference to B nor A Tutorial Introduction to the Language B nor The Development of the C Language elaborate on that particular topic. My hypothesis would be:

  • Because B allowed multiple characters in its character literals, it could no longer rely on differentiating characters from strings by the length of the literal (and because again, B had no type system to transparently inter-convert between them), and thus a syntactic distinction was necessary.
  • Character literals, as conceptually more lightweight (not requiring additional storage), were assigned the glyph that was (visually) simpler and took fewer keystrokes to type. (I shamelessly stole this one from @Toby Speight.) This may have been influenced by DEC assemblers, as suggested in @dave’s answer.

This explanation is mostly conjecture, but it seems we may have a hard time finding a better one.


0 Contemporaneous documentation used the term “constant” instead of “literal”, since it was the only kind available back then anyway. In fact, the C standard uses the term “constant” for literals to this day, and only very recently a paper was put forward to change the terminology.

18
  • 1
    BCPL seems to have only had strings, not character literals. Interestingly, sample BCPL code in Wikipedia uses " for string literals, while the BCPL manual at <bell-labs.com/usr/dmr/www/bcpl.html> uses ' in its description of the syntax. Commented Jun 25, 2021 at 18:59
  • 1
    @hb20007 I dug into the BCPL manual more closely, and it says BCPL made no syntactic distinction between character literals and string literals: ‘A string constant of length one has an Rvalue which is the bit pattern representation of the character; […] A string constant with length other than one is represented as a BCPL vector [i.e. array]’. BCPL had types, so maybe contextual disambiguation was tenable there, but certainly not in the untyped B. Commented Jun 25, 2021 at 19:17
  • 3
    On your foot note, the term "literal" is not anachronistic. There is a distinction between a literal and a constant. In the expression const int a = 'b'; (yes, that is legal in C) a is an int constant with value of 0x62 and 'b' is a char literal.
    – JeremyP
    Commented Jun 26, 2021 at 9:43
  • 1
    @user3840170 It's always been legal in C. Anyway, that's not the point. The point is that a literal and a constant are distinct concepts.
    – JeremyP
    Commented Jun 26, 2021 at 10:02
  • 3
    C did not, in its early life, have 'constants' that were not 'literal', so you can excuse that fairly impoverished language if its compilers sometimes confused the two. But other languages made the obvious distinction between a thing that literally denoted itself, and a thing that unchangingly denoted some value that was not the same as the marks from which it was made. (Note that other languages had 'denotations' rather than 'literals', but the concept is the same).
    – dave
    Commented Jun 26, 2021 at 11:09
25
+50

Not quite the same thing, but PDP-11 assemblers used 'X as a single character value (i.e., a byte), and "XY as a two-character value, (i.e., a word).

MOV  #"EH, BUFF
MOVB #'?, BUFF+2

The single/double quote corresponds nicely to the number of characters involved.

Ritchie et. al. would surely have been aware of this DEC convention. In fact, the same convention was carried over into the Unix assembler 'as' - which according to its man page was derived from the DEC assembler PAL-11R.

References: (both from 1971)

DEC usage: section 4.3 in this PAL-11R programmer's manual

Unix usage: section as(I) in the UNIX programmer's manual

FWIW, in DEC syntax, strings were rather different: a specific pseudo-op was used to declare strings, with arbitrary delimiter pairs, though slashes were conventional:

.ASCII /EH?/
.ASCII ZEH?Z
6

Consider how other languages handled characters. Many languages represented characters as strings of length 1, rather than their own type. These were inefficient in several ways:

  • The source code was more verbose. Compare

    IF MID$(Q$,1,1) = "A" THEN
    

    to

    if (q[0] == 'A')
    
  • Operations such as extracting one character or performing a comparison are more efficient with character types than with strings. The MID$ operation above allocates and copies yet another string. The = operation requires scanning through the two strings involved.

  • For compiled languages, character literals take up less space than string literals, both on disk and as a loaded program. String literals need to allocate space for the characters, the length or terminating character, and the address which references the string. Character literals can simply be an immediate operand of the instruction you were going to compile anyway.

  • There were also constructs like CHR$(65), whereas 'A' is both more efficient and easier to understand.

As the intent of C is to create highly-efficient programs, it was necessary to have a character type separate from string types. In turn, this meant having a way to represent character literals separately from string literals.

Modern compilers probably could determine the data type from the surrounding context, but early compilers simply weren't that sophisticated.

3
  • ISTR that Prime Fortran IV, (which for a while was a popular system programming language on Primes), extended Fortran with strings, which were packed into arrays of INTEGER (or anything else, I think).
    – Rich
    Commented Jun 28, 2021 at 4:16
  • I do not see how the first bullet point is in anyway necessary. Fortran uses ' and " interchangeably, characters are strings of length one, and still if (q(1:1) == 'A') works perfectly fine. Similarly for the second point. And the third point. You can have strings that come with their length information. Then youi do not need any terminating characters. Terminating characters in C are a well known source of buffer overflows and severe security failures. Commented Jun 28, 2021 at 17:01
  • 1
    @Rich Those are Hollerith. they were standard Fortran 66. If you take the example from the top answer: A=4Hhell B=4Ho, w C=4Horld. Commented Jun 28, 2021 at 17:07
3

In C, '' is used to denote a character, while "" is used to denote a string. Why was this syntax chosen?

The syntax difference is to describe two different constructs:

  • A character is a single value, used with its value, while
  • A string is an array of values, usually being pointed to.

Most important, without making that distinction, it would be impossible for a compiler to decide if "A" would refer to a character (value) of A, or define a string containing a single character A (with delimiting length or terminator)

Important for compiler construction: Having that distinction made upfront, with the first symbol of that token, simplifies the parser. Much like writing 0x in front of a hex number saves effort in seeing if it's a number or something else. The parser dos not have to read the whole token to see what it is about, but can go ahead according to that the leading symbol says.

Due to this necessity, two different quotes are needed.

Now why exactly these two were selected is hard to say, but it would seem intuitive that the single is for some short quote, while the double cover some longer item. This is as well kind of consistent with usage in English language writing, where speech and other quotation is primary between double marks. Which makes a lot of sense as regular English writing contains lots of single marks for concatenations and abbreviation.

It seems that C was the first widespread programming language which implemented this

C inherited it from B, which introduced this differentiation as part of its simplification from BCPL. B was written by Ken Thompson and Dennis Ritchie, who later went on to create C.

14
  • 13
    Double quote marks for strings are consistent with usage in American English orthography, but not with British English. (But of course C was invented by Americans).
    – alephzero
    Commented Jun 25, 2021 at 14:32
  • 8
    I like the consistency that has the narrower character ' for the shorter item. It could also be (I'm speculating here) that character literals were implemented before string literals, and on many keyboards ' is unshifted and " requires shift, so it's possible the simpler character was used first. Commented Jun 25, 2021 at 15:18
  • 2
    It's also interesting that there are two Unix-y conventions around this: one is paired-apostrophes as in C, which works best IMO for 'straight' apostrophes, and the other (often used in man pages?) pairs apostrophe with accent-grave, which looks horrible on any font I've ever used.
    – dave
    Commented Jun 25, 2021 at 17:16
  • 2
    @another-dave See this page for more about that quoting convention. It's also used in the m4 macro processor and in TeX.
    – texdr.aft
    Commented Jun 25, 2021 at 17:54
  • 4
    In British English (or English, as the British call it), a quotation is enclosed in quotation marks, unless it’s nested. Single quotes are typically used when referring to something that might not be a valid linguistic construct otherwise, and single characters would certainly fit into this. So the choice to punctuate this way is entirely logical although I have no idea whether that was a driving force for the C language.
    – Frog
    Commented Jun 25, 2021 at 21:25

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .