What are source and execution character sets?

Question

I was looking at the changes in C23, and found this in Annex M of the C23 draft:

added @ (U+0040, COMMERCIAL AT), $ (U+0024, DOLLAR SIGN), and ` (U+0060, GRAVE ACCENT, "Backtick") into the source and execution character set;

What is difference between "source character set" and "execution character set"? Are $, @, and ` allowed in identifiers in C23?

I found a question for C++: stackoverflow.com/q/3768363/20017547, but not C. — Harith, Commented May 27 at 8:43
They are both described in the relevant option setting at Compiler options listed alphabetically. — OldBoy, Commented May 27 at 8:48

Lundin · Accepted Answer · 2024-05-27 10:18:44Z

There is a straight-forward explanation in C23 (and older) 5.2.1 where the terms are formally defined:

Two sets of characters and their associated collating sequences shall be defined: the set in which source files are written (the source character set), and the set interpreted in the execution environment (the execution character set). Each set is further divided into a basic character set, whose contents are given by this subclause, and a set of three or more locale-specific members (which are not members of the basic character set) called extended characters.

Where the basic character set is basically ASCII/UTF-8.

C23 added a subchapter 5.2.1.1 allowing the basic character set to contain multibyte characters, including the ones you mention.

The source character set may contain multibyte characters, used to represent members of the extended character set. The execution character set may also contain multibyte characters, which need not have the same encoding as for the source character set. For both character sets, the following shall hold:

— The basic character set, @ (U+0040 Commercial At), $ (U+0024 Dollar Sign), and ‘ [SIC] (U+0060 Grave Accent, "Backtick") shall be present and each character shall be encoded as a single byte.
...

My take is that 5.2.1.1 is optional/implementation-defined, given the "may".

Kaz · Accepted Answer · 2024-05-28 04:47:01Z

Programs can be cross-compiled. The machine and environment where the program is compiled can have a different character set from the environment where the compiled program executes.

In principle, we could be working with a C program whose source code is written in ASCII, on an ASCII operating system like Unix, but be cross-compiling it to a machine which uses a different representation of text, like EBCDIC.

So for instance in the source code, the character constant 'A' is represented by the ASCII character 65, but the cross-compiler must map that to the EBCDIC value of A which is 193.

The specification for a language that can support cross-compiling need to be carefully worded. For instance, the requirement cannot be stated like "the value of a character constant consisting of a single unescaped character between single quotes is the code point value of that character". Instead, we specify it in terms of mapping between translation and execution character sets.

About $, @ and `, the simple fact is that they are not used in the language: they don't serve as punctuators and are not parts of identifiers. These characters may occur in comments, if they are part of the translation character set, and in character constants and string literals if they have a mapping to the translation character set.

The C language could be implemented in a hypothetical environment which lacks these characters. Therefore, a strictly conforming program cannot use them for any purpose, not even in a comment. When they are used in a program, and the program is accepted, it is an extension. In other words, when you use the full ASCII character set in your C programs, you're relying on a language extension (a very common one).

C implementations can also use these characters for extensions. For instance, some compilers allow $ in identifiers, and that may be necessary in order to connect with some external names (like in assembly language). The Objective C dialect used @ as a prefix for its extensions.

Xiangzhi Liu · Accepted Answer · 2024-05-28 04:20:53Z

source and execution character sets are a bundle of characters that must be "recognized" by compilation and execution environment, respectively. For example, characters in basic source character sets must be interpreted correctly by any strictly conforming compilers.

$, @, ` cannot be used in identifier because they neither have the property of XID_start nor XID_continue. They are added into basic source character set only means that compiler knows 0x40, 0x24 and 0x60 is $, @ and `, respectively(assuming ASCII encoding). For example, the following code will cause a compilation error(violation of syntax contraint) in C23, but undefined behavior(Annex J.2 (6)) before C23.

int main()
{
    @
}

A character not in the basic source character set is encountered in a source file, except in an identifier, a character constant, a string literal, a header name, a comment, or a preprocessing token that is never converted to a token (5.2.1).

Collectives™ on Stack Overflow

What are source and execution character sets?

3 Answers 3

Not the answer you're looking for? Browse other questions tagged
c
c23
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Not the answer you're looking for? Browse other questions tagged cc23 or ask your own question.

Linked

Related

Not the answer you're looking for? Browse other questions tagged
c
c23
or ask your own question.