21
$\begingroup$

Inspired by Why do keywords have to be reserved words?

Suppose that you're the BDFL of a programming language. Version 1 of the language becomes decently popular. A few years later, you decide to make Version 2 with a bunch of new features. This requires new keywords to be added to the language. For the sake of making the example concrete, let's suppose that the new keywords are async and await.

Problem is, there are programs written in Version 1 of the language that happen to use async or await as the name of a variable or function. And now that these are reserved words, the existing code no longer compiles.

What approaches do existing programming languages take to minimize the hassle of renaming identifiers when they conflict with newly-added keywords? How would you choose to handle this situation?

$\endgroup$
3
  • 8
    $\begingroup$ Another option is not adding new keywords and 'reusing' keywords instead. Hence so many different meanings of static in C. $\endgroup$
    – CPlus
    Commented Jul 12, 2023 at 2:26
  • 2
    $\begingroup$ My first thought was that the language should require variables to have some special signifier - for example $variable. The $ sign would not be present in keywords. That way, any new keyword introduced could not possibly be conflated with an existing variable. $\endgroup$ Commented Jul 13, 2023 at 9:19
  • $\begingroup$ The question assumes that keywords are necessarily reserved words. This assumption isn't always true. E.g FORTRAN has many keywords, but no reserved words. $\endgroup$ Commented Jul 27, 2023 at 1:15

14 Answers 14

25
$\begingroup$

Require explicit opt-in to language version

Taking inspiration from C#, one can require programs declare which version of the language they are using. Then, the compiler/interpreter may choose to only handle select newer features depending on declared version.

An example is the required keyword, which in programs written for <=C#10 will warn when attempting to use it, and >=C#11 will refuse to accept as an identifier.

This allows older programs to compile and run fine on a newer compiler/interpreter, but newer programs may opt in.

If version is declarable at the file or scope level, then new keywords can be used in an older program without a total rewrite.

$\endgroup$
12
  • 4
    $\begingroup$ I don't know why more languages don't do this, since it simultaneously accommodates an unlimited range of language extensions while providing 100% compatibility with programs that specify what language version they require. $\endgroup$
    – supercat
    Commented Jul 12, 2023 at 20:24
  • 11
    $\begingroup$ @supercat Because at some point you just have several different languages instead. $\endgroup$
    – Passer By
    Commented Jul 13, 2023 at 6:50
  • 4
    $\begingroup$ @supercat this is also how ultra-legacy systems are created because companies will simply put zero resources into upgrading, since "it works". I've worked on MSSQL database stuck on some archaic 2003 compatibility level simply because the development company couldn't be arsed to do proper testing and actually use newer versions. This level is missing some critical stuff such as the string_agg function, and it's just terrible. $\endgroup$
    – Nelson
    Commented Jul 14, 2023 at 0:49
  • 1
    $\begingroup$ @PasserBy: The C Standard has forever been blocked from providing features that would be useful for many applications because a few otherwise-useful implementations would be unable to support them. Recognizing the existence of constructs that implementations should support when practical would allow the Standard to recognize features that are common to implementations designed to be suitable for low-level programming on commonplace platforms. $\endgroup$
    – supercat
    Commented Jul 14, 2023 at 14:47
  • 1
    $\begingroup$ @PasserBy: Further, it is useful simultaneously for a language to have the notion "The type unsigned char is 8 bits on all conventional platforms", but also to say "The TI 32050 compiler processes a language that's just like 'ordinary C', except that character types are 16 bits", or "The Atari 400 C compiler behaves as specified by the C Standard, except that [list of features] are omitted to allow operation with 16K of RAM, and programs requiring those features will be rejected". In general, a compiler that supports full-precision double may be more useful than... $\endgroup$
    – supercat
    Commented Jul 14, 2023 at 16:24
17
$\begingroup$

Let's suppose that the new keywords are async and await. What approaches do existing programming languages take to minimize the hassle of renaming identifiers when they conflict with newly-added keywords? How would you choose to handle this situation?

I did have to handle that situation, so I can tell you exactly what we did.

In C# 6 await means the operator only in methods that are declared with the async modifier. In C# 5 and before it was a syntax error to use async as a modifier, so there was no back compat issue; it is not a breaking change to turn a program that never compiled into one which does. Moreover, async as a modifier is legal only in a declaration; every other usage can be treated as an identifier, so again, no back compat issues.

That is, async had no semantic purpose in C# 6. It is only to signal to the compiler that await means the operator in what follows.

More generally, we used several different tricks throughout the evolution of C# to add new keywords. First off, it's important to realize in C# that there are "reserved" and "contextual" keywords. C# has only added contextual keywords since C# 2. (And C# 1 started with contextual keywords; get and set are contextual for example.) Also, C# allows any identifier to be preceded by @ to indicate that it is to be treated as an identifier, not a keyword, even if it is reserved.

Every time a new keyword was added the grammar and/or semantic analysis pass was carefully designed so that we could determine whether the keyword was being used in its new sense, or was an identifier. For example when yield was added to C# 2 the original design was just yield whatever; as a statement, but that is grammatically ambiguous with declaring local whatever of type yield, so the design was modified to require yield return whatever;, which was always a syntax error in C# 1.

You've probably noticed that this choice for C# 2 was different than the choice for C# 6; we could have in C# 6 chosen a two word phrase for the operator instead of requiring a marker modifier. It could have been async wait whatever; and that would have been a syntax error in previous versions of the language. We considered many options like that before arriving on the marker modifier solution; consistency with yield return was attractive but ultimately was considered too wordy. (Whether the C# 2 design team considered a marker modifier for iterator blocks, I do not know offhand.)

When designing LINQ for C# 3 there were a great many contextual keywords added and it was not clear what query syntax we'd arrive at. To facilitate finding problems early one of our developers with expertise in parser theory wrote a grammar analyzer which could produce examples of expressions that parsed ambiguously for a given query grammar proposal. That certainly helped us find lurking problems early in the design process.

In C# 4, dynamic was easy. There's no special grammar logic needed, and the semantic analyzer just checks to see if dynamic already has a meaning in the program and if it does, it uses that, otherwise it falls back to the special type. Same for var in C# 3.

I hope that helps; let me know if you want more details on any of these techniques.

$\endgroup$
4
  • 1
    $\begingroup$ Wow, one of the actual designers of a programming language I use answered my question. I feel honored ☺️ $\endgroup$
    – dan04
    Commented Jul 15, 2023 at 17:57
  • 1
    $\begingroup$ "yield return" is easyer to understand then "yield" for someone who have not seen it before. $\endgroup$ Commented Jul 16, 2023 at 18:21
  • 3
    $\begingroup$ @dan04: You're very welcome but I assure you also, I'm just some guy. $\endgroup$ Commented Jul 17, 2023 at 19:43
  • 2
    $\begingroup$ The point about yield return in C# is really interesting; I have sometimes wondered why C# uses this instead of the simpler yield, but now of course I completely understand the "backwards-compatibility" reason behind this! $\endgroup$
    – printf
    Commented Jul 22, 2023 at 20:54
16
$\begingroup$

Depending on the syntax specifics, a new keyword can be contextual at the expense of a new ambiguity the parser must handle.

For example, let's say you add a prefix unary expression operator await to your language, and there's a program with an await identifier:

var await = 4;

This need not be an error since it is not a legal use of await anyway. In fact, in most positions where an identifier is declared, a keyword is not going to be ambiguous. Later, in other expressions, when the parser encounters a sequence like

var x = await

it needs to do one (or possibly more) token of lookahead to see what's after it. For example, you might have

var x = await * 2;

The * token is enough to disambiguate.

Problems may arise if a certain token forms both unary and binary expressions, for example

var x = await + await;

This might be await (+await), or (await) + (await), depending on your rules. In this situation you might choose to simply require parens to disambiguate, choose the new or old behavior consistently, or defer to other strategies as described in other answers.

$\endgroup$
1
  • 7
    $\begingroup$ This is why the Swift 5.9 keywords discard, borrow, and consume must be applied directly to an identifier rather than an expression in parentheses — discard(self) is just a regular function call in 5.8, while discard self would be illegal. $\endgroup$
    – Bbrk24
    Commented Jul 12, 2023 at 11:39
14
$\begingroup$

Provide an upgrade tool

Another approach is to not preserve backwards compatibility and instead provide tools to upgrade to the new version.

In the case of adding a couple of new keywords all the upgade tool would have to do is rename variables using the string that is now a reserved word.

Make it possible to identify the language dialect in use

This procedure is facilitated if you have something in the source code that can identify the version in use. Otherwise your tool may incorrectly treat the new keywords as variables.

Example

For example running foo1to2 on:

dialect(foolang,1)

var IamAKeyword : integer = 2;

It becomes:

dialect(foolang,2)

var IamNotAKeyword : integer = 2;

Upgrade tools can be a pain to write if the nature of the change is complex or the syntax is complex. You can make your life easier if you consider this issue upfront.

Consider context

As others have said you can somewhat avoid the issue if your new syntax is illegal in the previous language version/dialect.

Beware though that you are trading convenience of upgrade for clarity in the language.

An interesting example of this is C++. Relatively early on for pure methods the = 0 syntax was introduced rather than a pure keyword

class Foo
{
   virtual void someMethod() = 0;
};

assignment of methods was not a legal syntax. Much later when override and final were introduced they appear in the same place. So instead of:

class Foo
{
   override void someMethod();
};

Where override could seem to be the name of the return type say You write:

class Foo
{
   void someMethod() override;
};

There was a good link about this which I cannot find but https://stackoverflow.com/questions/32757571/why-do-the-c-language-designers-keep-re-using-keywords is very relevant.

Don't bother - just document

A final option is to not to anything other than document carefully the changes.

You leave the community deal with the issue. They may well come up with upgrade tool on your behalf.

This risks alienating your user base, particularly if you do it too often. But if your language is good and your community loyal you can get away with it. This is really trading user friendliness for development resources.

There are quite a few languages that are guilty of this (despite having ample developement budget) and remain popular.

Deprecate

When evolving a language you could consider formally deprecating interfaces in one version before removing or changing them in a subsequent version. This slows down changes to your langauge but makes it easier for users to manage the transition.

This is particularly useful in 'enterprise' environments.

You can also consider having a syntactic or semantic way to mark something as deprecated in the language itself. For example an attribute like [[deprecated]] in C++ can be used to indicate a function is to be retired.

This may or may not be harder to do for a keyword. In some languages keywords are a lot like functions or even exactly the same (e.g. TCL.

My personal preference is both:

  1. a clear way to the dialect/version used.
  2. an upgrade tool
$\endgroup$
7
  • 2
    $\begingroup$ Along with the "don't bother" option should be "warn ahead of time". This is similar to deprecating features -- you warn about it during release N-1, then implement the breaking change in release N. $\endgroup$
    – Barmar
    Commented Jul 12, 2023 at 15:44
  • 1
    $\begingroup$ @Barmar: An important aspect of deprecating features is ensuring that anything that could be done using a deprecated feature can be done at least as well--if not better--without it. $\endgroup$
    – supercat
    Commented Jul 13, 2023 at 17:43
  • $\begingroup$ In this case, you just have to rename any conflicting variables. @supercat The warning gives you time to do that. $\endgroup$
    – Barmar
    Commented Jul 14, 2023 at 2:40
  • $\begingroup$ In this particular case, that would be true unless one needs to link with code outside one's control. But in the more general case, depecation requires ensuring that replacements exist--a concept that maintainers of the C language ignore. $\endgroup$
    – supercat
    Commented Jul 14, 2023 at 4:50
  • 1
    $\begingroup$ @supercat Java deprecated sun.misc.Unsafe without providing a replacement or alternative. If they had provided a replacement then it would defeat the purpose of deprecating it, which was to remove something that was never supposed to be available to users in the first place. I think you are using the word "requires" to mean that this is what you think language maintainers should do, but AFAIK this is not part of the definition of the word "deprecate". $\endgroup$
    – kaya3
    Commented Jul 24, 2023 at 12:56
11
$\begingroup$

Stropping was the approach taken in some early programming languages like Algol-60 and Algol-W. The Wikipedia Article says

In computer language design, stropping is a method of explicitly marking letter sequences as having a special property, such as being a keyword, or a certain type of variable or storage location, and thus inhabiting a different namespace from ordinary names ("identifiers"), in order to avoid clashes. Stropping is not used in most modern languages – instead, keywords are reserved words and cannot be used as identifiers. Stropping allows the same letter sequence to be used both as a keyword and as an identifier, and simplifies parsing in that case – for example allowing a variable named if without clashing with the keyword if.

Algol-60 implementations frequently used single quotes around a keyword, e.g. 'BEGIN'. On some machines it used underlining. And in the language specification this was typically represented as bolding.

Algol 68 used several alternate strapping conventions, that could be specified by a compiler directive, like .FOR, 'FOR, upper/lowercase... plus they had the option of not requiring strapping at all, and just requiring reserved words like most modern languages. In publication keywords were typically represented as underlined or bold.

FORTRAN has limited stropping for reserved words or operators like .EQ. and .AND.

Stropping went out of fashion as programming language portability became more and more important, since the least common denominator for composing programs was in plain ASCII text with little or no formatting. The usual mechanisms of displaying stropped keywords via fonts like bold could not be relied on to be portable.

Stropping avoids breaking existing programs if new keywords collide with identifiers in old code. Stropping avoids the contortions that languages like C++ need to provide for extensibility, avoiding such conflicts. Nearly all modern programming is done using IDEs or editors that can do syntax coloring. Syntax coloring is almost exactly what you need to do to make stropping user-friendly. You don't need to require the user to type the stropping syntax: the user can type in plaintext, and the editor/IDE can apply stropping automatically, and save the stropped format in a file. If new syntax is added with a new keyword foo that happened to be used as an identifier foo in old programs, no problem. There are several portable formats that can save such stropped program representations, if necessary with more metadata than simple .stropping. can represent.

However, stropping does not completely eliminate problems with extensibility. E.g. if you wish to design a language that allows new language constructs, syntax, and keywords to be added to your language. If there are multiple such extension packages, they can conflict. Ultimately you will need to have some sort of namespace indication for such syntactic extensions. So you might want to go beyond simple .stropping., who have representations that can store more metadata.

$\endgroup$
7
  • 1
    $\begingroup$ A language could require that keywords be punctuated a certain way using ASCII characters in a way which may look ugly if printed in a single font without highlighting, but also specify a preferred visual representation which would look much more attractive (but be easily recognizable as mapping to the original ASCII construct). $\endgroup$
    – supercat
    Commented Jul 13, 2023 at 17:47
  • 1
    $\begingroup$ For example, it could specify that a Euclidian division operator be written as ` .ediv. ` in text (with white space to either side), but that editors should when practical show alphanumeric strings delimited by periods and whitespace in such fashion as italics with the periods replaced by micro-spaces, e.g. ediv. $\endgroup$
    – supercat
    Commented Jul 13, 2023 at 17:50
  • 2
    $\begingroup$ Huh. All this time, I thought all those old papers I read with Algol code in them were just nicely formatted to make them easy to read. Never realised that highlighting the keywords was a required feature of the language. :) $\endgroup$
    – occipita
    Commented Jul 15, 2023 at 15:00
  • $\begingroup$ @occipta: stropping was not a "required" feature of the language. As I mentioned there were several different stropping conventions, and some compilers could support several, including the "no-stropping" version where things were just you were in whatever character set you had. However, I am 99.9% certain that various Algol 68 programs had some variable names that were the same as keywords. And I think I remember this also in either Algol 58 or Algol 60. There were various tools to convert from one's dropping representation to another, which is trivial, except for keyword collision. $\endgroup$
    – Krazy Glew
    Commented Oct 13, 2023 at 5:09
  • $\begingroup$ Now something controversial: IMHO we should use the moral equivalent of syntax coloring or stropping to distinguish keywords from variable names, etc. It should be extensible. XML is a perfectly good exchange format. The low-level representation of your code might look like <if> if == then <then> then() </if>. XML is that it is future extendable. A compiler might not understand a new keyword, but it least it can recognize the code that uses it and report a meaningful error, re-synchronizing at the end of the construct </if>, rather than going wildly crazy as so often happens with C++ code. $\endgroup$
    – Krazy Glew
    Commented Oct 13, 2023 at 5:18
9
$\begingroup$

A quick and dirty approach C uses is by reserving all identifiers starting with _, and using _Keywords as built-in keywords and only having the keywords as convenience macros in header files.

$\endgroup$
9
$\begingroup$

Use namespaces everywhere

This could mean many things. You may choose what is appropriate in your language:

  1. A user-defined identifier could override the meaning of a keyword.
  2. You could use the original keyword by explicitly specifying the namespace, for example system::await.
  3. You could alias a keyword to a different name in a program.
  4. You could import the namespace only if you need the keyword in your program.

It might be easiest to implement only #4, in which case the namespace may not be called a namespace, but a feature option that you could turn on and off. But there would be problems if it exported identifiers that code using new features would use. You could support aliases in the import, but that wouldn't work well for members in a structure. But there is an easy solution, by using a special grammar for forcing the use of an identifier that is the same as a keyword. You could turn off the option in old source code files, and when you want to upgrade a file, change the identifiers with conflicting names to the new grammar. C# uses @x. Some SQL dialects use `x` or [x]. Shell scripts, say bash, uses 'x' or any other quotes or escapes, and also command x which also avoids functions. PHP uses ${'x'} for names that is not in the format of an identifier, but may not be the best example because variables and keywords are not in the same namespace in the first place.

$\endgroup$
1
  • 1
    $\begingroup$ I do like option #4: it's similar to @ATaco's version opt-in, but more fine-grained; in particular it avoids the issue that you can't just require the latest version in all of your code, because that could unnecessarily break backwards compatibility. $\endgroup$
    – G. Sliepen
    Commented Jul 12, 2023 at 14:07
7
$\begingroup$

Different lexical forms for keywords and identifiers

In my current design, identifier names must contain a capital letter, digit or underscore; anything that is purely lowercase letters is inherently a keyword. There is thus no risk of a variable name becoming a reserved word in a future version, because it can't be a valid variable name in the first place.

$\endgroup$
8
  • $\begingroup$ I would do it the other way around, using all uppercase for keywords and otherwise reserved words. Uppercase is harder to type and keywords should be very short and you need to type more letter for non-keywords. It makes it also the same as in SQL convention (but not enforced). Uppercase is also harder to read, which is better suited for something you know than something you have read completely first. It is also easer to search inside the code without accidentally finding too many matches in the comments. $\endgroup$ Commented Jul 13, 2023 at 11:36
  • $\begingroup$ Good answer but under "namespace" i understand something different (like C++ namespace). $\endgroup$ Commented Jul 13, 2023 at 11:38
  • $\begingroup$ I feel like uppercase keywords would be too visually distracting due to their frequency. For me, all-uppercase should be reserved for broad-scope constant values. $\endgroup$ Commented Jul 13, 2023 at 19:10
  • 1
    $\begingroup$ This idea, user23013's, user16217248's and Krazy Glew's are all quite similar, really. Arguably, different implementations of the same underlying idea. $\endgroup$ Commented Jul 27, 2023 at 8:06
  • $\begingroup$ @KarlKnechtel: yes, all of these syntactic distinctions expressed in the characters that are typed by and visible to the user are similar. IMHO where it gets interesting is when there is hidden metadata with the constructs to solve many of the other problems with compatibility. Then you don't need to have arbitrary restrictions like "keywords must be lowercase", or "keywords must be uppercase. $\endgroup$
    – Krazy Glew
    Commented Oct 13, 2023 at 6:00
6
$\begingroup$

You can add syntax to tell the parser which language standard to use. Ada has that.

But, in most cases it is fine to warn the user upfront and use keywords that most people expect to be keywords anyway.

$\endgroup$
5
$\begingroup$

Python had exactly this issue (specifically for async/await), and ended up introducing those as 'soft keywords'. Essentially, there's no special case for them in the lexer, so they are treated as identifiers like any other, but there are also grammatic rules that expect these specific identifiers in specific positions. Since the new grammatic constructs would've been illegal previously anyway, no collisions take place. The only (slight) downside to this approach is performance of parsing, but the effect is miniscule

$\endgroup$
5
$\begingroup$

I think that the possibility of adding new keywords reinforces the idea that they should either be a property of the base language ab initio, or should be identified in some way. As somebody else has mentioned, ALGOL mandated or at least permitted stropping, while in Modula-2 Wirth mandated that keywords were capitalised while at least preferring that non-keywords (i.e. imported from libraries etc.) were not.

I would suggest designing the base language such that a spurious non-keyword identifier can always be recognised as an error, and that pragmata are used to indicate e.g. storage class rather than keywords.

From there, I would suggest first providing an intermediate version of the language, where any existing usage of the new keyword is flagged as an error which either has to be fixed or blessed by a pragma indicating that it's been reviewed.

Putting keywords in their own namespace, or requiring that a program unit be imported to make them available, is one approach. But my own feeling is that they should be minimised, in order- apart from anything else- to reduce lexing problems caused by the demands of internationalisation.

$\endgroup$
5
$\begingroup$

Make sure those keywords can only appear in places where program-specific identifiers can not. That way there can be an identifier that is spelled like a keyword without their being any ambiguity. This can be difficult in a C-style language where identifiers can appear in all kinds of places. But it is far more achievable in a COBOL-style language where every line begins with a keyword that says how the rest of the line is to be parsed. For example, when your syntax for calling a function looks like that:

FUNCTION async TAKING input TYPE string
   WRITE input
ENDFUNCTION

CALL FUNCTION async PASSING await

then you could in the next vesion allow this:

ASYNC FUNCTION async TAKING input TYPE string
   WRITE input
ENDFUNCTION

AWAIT FUNCTION async PASSING await

When every line must start with a keyword, then you can add as many more line-starter keywords as you want without there ever being a conflict.

$\endgroup$
1
  • 1
    $\begingroup$ This doesn't work for all changes, though, as some may require a new keyword to appear in a place where it could previously have been used. See above for Eric Lippert's example of "yield" for C#. And there's a reason nobody writes new programs in COBOL any more -- it is so ridiculously verbose that you just can't write moderately complex programs so that they're readably small. How anyone could stand using it on punched cards is beyond me.... $\endgroup$
    – occipita
    Commented Jul 15, 2023 at 15:11
3
$\begingroup$

Preprocessor magic

One approach that can be used if your language has a token stream to token stream preprocessor, is to utilize it in the keyword-or-identifier decision making process.

The basic idea is to change the lexer to always output keyword or identifiers as identifier tokens1, leaving it up to the preprocessor to replace identifier tokens representing a keyword with a keyword token.2 This makes it possible to have preprocessor directives that can affect the identifier or reserved word decision. For example, you could have a preprocessor directive that turns on or off a set of keywords, or even an individual keyword. Doing it as a matched set of directives (akin to #if and #endif) can be especially useful for scoping these changes, especially for code that may be #included and not know what state to return to.

The main advantage of this versus a language version compiler flag, is that you can write code that uses the new keyword for its new purpose in the same file, simply by wrapping the section that uses it with a preprocessor directive.

The only language I am aware of that does this is Verilog, which supports a `begin_keywords preprocessor-like directive, and a corresponding `end_keywords. The `begin_keywords directive takes an argument that indicates the standard revision that defines the keywords to support.


Footnotes:

1 Traditionally a lexer will output special tokens for keywords, making them trivial for the parser to identify. Having to look inside a token to decide the parse path is more complicated, and often not supported by parser generators if your language is using one of those.

2 Obviously, there are some variations possible, like always output all reserved words as their own tokens in lexer, and having preprocessor conditionally convert them back to identifiers, etc.

$\endgroup$
2
$\begingroup$

Reserve unused keywords for future use

Most of the time a new keyword is added to a language after its initial release, the keyword is one which exists in other languages already. The obvious examples are async, await and yield, but e.g. const and let were added to Javascript, case and match were added to Python, and so on.

This means that with some foresight, it's possible to reserve some keywords before the associated features are added to your language; then when it comes to add those features, there is no backwards compatibility problem because the keywords you need have already been reserved from day one.

Examples:

  • In Rust, the keywords abstract, become, box, do, final, macro, override, priv, typeof, unsized, virtual and yield are currently reserved for future use.

  • In Java, the keyword _ is reserved for "possible future use in parameter declarations"; and const and goto are reserved to allow for better error messages in case C++ programmers use these by mistake, but it would also be possible for Java to use them for new features in the future.

  • Nim likewise has some unused keywords reserved for future use, but I didn't find a comprehensive list of which ones are currently unused.

$\endgroup$
1
  • $\begingroup$ For Nim, I think the only unused keyword is interface. $\endgroup$
    – xigoi
    Commented Jul 27, 2023 at 15:23

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .