8
$\begingroup$

Contextual keywords are tokens that are keywords in some contexts, and identifiers in others. Consider this example, which SE's highlighter gets wrong:

weak var weak: AnyObject?

The first weak is a keyword, but the second is a variable name. No escaping/delimiting is required, since weak is only a keyword when it appears before var, or before an identifier in a lambda’s capture list (i.e. the ubiquitous { [weak self] in ... }). Note that it need not appear immediately before var, e.g. weak private var. Compare this to if the variable were itself called var, which would need a delimiter:

weak var `var`: AnyObject?

Another example I see in many C-family OO languages is set:

void DoSomething(ISet<int> set);
int Foo { get; set; }

set only functions as a keyword in the latter case, yet the syntax highlighter picks up both.

This is typically a non-issue when semantic highlighting is available (which colorizes identifiers depending on what kind of thing they refer to). However, pure syntax highlighters get this wrong frequently. This issue isn't unique to StackExchange: see also GitHub and SwiftFiddle making the same mistake.

Why is this issue so common, and how can it be mitigated?

$\endgroup$
1
  • 1
    $\begingroup$ I think the characterisation of "pure syntax highlighters" here is misplaced: all of these cases are syntactically unambiguous. The problem is arising from either purely lexical markup or just incorrect parsers. $\endgroup$
    – Michael Homer
    Commented Jun 23, 2023 at 4:27

2 Answers 2

7
$\begingroup$

You identify contextual keywords by parsing the syntax of the language, and you get it wrong because you don't parse the syntax of the language. For these examples you don't need to know what kind of identifier something is, because only some of them are identifiers at all, and once parsed you know which those are. I don't agree with the characterisation of this as a problem with "pure syntax highlighters" versus semantic highlighters: it's (primarily) between syntax highlighters and highlighters that don't actually follow the syntax.

Most "syntax highlighters" are really just token-based lexical markup, however, in that all that they do is classify substrings into a number of buckets based on enumerations, regular expressions, and the like. Substrings that match a certain pattern get given the "keyword" or "string literal" colouring, in some priority order. What they're doing is more like lexical highlighting, sometimes with a little more complexity to handle common things like string escapes.

This has been how general-purpose editors handled highlighting for a long time, and it's good enough to be "good enough" a lot of the time, but you can usually construct cases where it's wrong even without contextual keywords. Dedicated IDEs for particular languages, like Visual Studio or XCode, get dedicated parsers, but other systems don't. The right solution in most cases is to use a real parser for the actual grammar of the language, which will usually classify all the source spans correctly (even a proper parser isn't always good enough - sometimes you really do need semantic information of some sort, like C (a)*b where you can't even lex it correctly without that information).

Compared to these quick-cheap-and-easy regular-expression matchers, though, parsers are slow, expensive, and hard. Stack Exchange already rejects adding more languages to the highlighter in their current form because the resulting JavaScript blob will get too big, and it would be even larger if they all had complete parsers. For syntax highlighting where the source may be wrong, it's important that the parser can recover and attempt to highlight the rest of the text after an error, it's going to need to be a pretty complex parser, too.


The highlighting GitHub uses for Swift is here. You can see that it matches weak with just a regular expression: (?&lt;!\.)\bunowned\((?:safe|unsafe)\)|(?&lt;!\.)\b(?:weak|unowned)\b. You can see where the keywords are defined for the Stack Exchange highlighting here, and the associated logic here. This one makes a bit more of an effort, but still bundles all the keywords together undistinguished in one big array.

In both cases, these could resolve the specific cases you've identified with more complex matchers. The TextMate highlighting could use a negative lookbehind to check for var or let beforehand, for example. This would solve some of the issues, but not all of them; a zero-width lookahead for things like = or . probably cuts out a few more. None of these approaches are going to be perfect, because even with these sorts of enhanced regular expressions, this level of analysis fundamentally can't handle a context-free grammar like this.

For your C# example, the two sets require even deeper syntactic analysis to distinguish, and again the existing highlighters don't try to. get; set; is a common-enough pattern that a specific carve-out probably could be made, although catching all the other keyword uses of set is still pretty hard. ... } set has a keyword only inside a property block, but could appear anywhere else referring to the identifier.


I say "parser" rather than "grammar" here because most languages don't have a precise formal grammar: there's often a covering grammar, and then some additional processing that addresses some corner cases. If you're designing the language, you could make the job a little easier by sticking to a formal grammar, and so conceivably different highlighting engines could plug in your careful LALR grammar or something like that; in practice, this isn't how they mostly work at the moment, but in an ideal world it would help. This will still be large and likely even slower than a parser, but it is easily reusable, although in practice it's currently pretty useless for this particular purpose.

You could also make things even easier by not having any syntax-dependent elements at all: if every delimited token can only mean exactly one thing, then current regular-expression highlighters can do the job. If you want public highlighters to be able to do the job, this is probably your best bet. It means certainly no contextual keywords, but also limits other constructs that can't be identified correctly that way.

If you have a language that already has these elements and you're trying to implement a highlighter for it specifically, the answer is to use a real parser for the actual syntax of the language, and highlight based on the grammatical classes you assign to each source span. A pretty-printer from the parse tree can be helpful in this if you're generating fixed output, like the websites you've pointed at.

In practice, though, for existing languages and public highlighters, slight improvements to the regular expressions used for matching tokens are the best you can likely do. Eventually you may be able to push the corner cases so far out of the way that nobody actually encounters them, but the trade-off of ever-more-incomprehensible advanced regular expressions is a real one that becomes not worthwhile eventually (and at the point of having (?&lt;!\.)\bunowned\((?:safe|unsafe)\)|(?&lt;!\.)\b(?:weak|unowned)\b, maybe it already has). These highlighters are just making a best-effort guess at being right often enough to be useful.

$\endgroup$
1
  • 3
    $\begingroup$ You have a good point about handling errors gracefully. Even dedicated IDEs get this wrong: you could kill Xcode 13’s syntax highlighter (haven’t tried Xcode 14) by putting a null byte in a block comment, and Xcode 14.0 crashed the whole IDE when it tried to syntax highlight some invalid regex literals. $\endgroup$
    – Bbrk24
    Commented Jun 23, 2023 at 11:47
5
$\begingroup$

The only answer is to use the actual language parser for the syntax highlighting, not some cut down part of its lexer.

In my languages I do this for the IDE integration and code formatting for the literate output.

As they are parsed by PEG, I just add highlighting and prettt-printing hints to the grammar, and the parser collects ranges for every highlighting hint. Ranges are releaaed once parsing is finalised (as in, no backtracking), or if it failed completely (parsed partially). IDE then simply applies the ranges.

I am no fan of lexers in general, so I prefer to skip them altogether.

Now, since you're asking about the pure syntax highlighters, I assume you mean those that can be embedded into web pages. They are typically just primitive regexp-based lexers, and we know we need semantic highlighting. The approach I described above allows to take any point of the language state (e.g., if your language syntax is extensible), and extract the portable PEG implementation of the parser along with the range highlighter and pretty-printer that can be then packaged and embedded for a client-side highlighting.

You won't get semantically rich things like type hints here, of course, but a pure parser is quite possible to extract into a portable, compact and fast implementation that is much more capable than a bunch of regexps, and, most importantly, won't require any manual work - it can be generated automatically.

$\endgroup$

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .