You identify contextual keywords by parsing the syntax of the language, and you get it wrong because you don't parse the syntax of the language. For these examples you don't need to know what kind of identifier something is, because only some of them are identifiers at all, and once parsed you know which those are. I don't agree with the characterisation of this as a problem with "pure syntax highlighters" versus semantic highlighters: it's (primarily) between syntax highlighters and highlighters that don't actually follow the syntax.
Most "syntax highlighters" are really just token-based lexical markup, however, in that all that they do is classify substrings into a number of buckets based on enumerations, regular expressions, and the like. Substrings that match a certain pattern get given the "keyword" or "string literal" colouring, in some priority order. What they're doing is more like lexical highlighting, sometimes with a little more complexity to handle common things like string escapes.
This has been how general-purpose editors handled highlighting for a long time, and it's good enough to be "good enough" a lot of the time, but you can usually construct cases where it's wrong even without contextual keywords. Dedicated IDEs for particular languages, like Visual Studio or XCode, get dedicated parsers, but other systems don't. The right solution in most cases is to use a real parser for the actual grammar of the language, which will usually classify all the source spans correctly (even a proper parser isn't always good enough - sometimes you really do need semantic information of some sort, like C (a)*b
where you can't even lex it correctly without that information).
Compared to these quick-cheap-and-easy regular-expression matchers, though, parsers are slow, expensive, and hard. Stack Exchange already rejects adding more languages to the highlighter in their current form because the resulting JavaScript blob will get too big, and it would be even larger if they all had complete parsers. For syntax highlighting where the source may be wrong, it's important that the parser can recover and attempt to highlight the rest of the text after an error, it's going to need to be a pretty complex parser, too.
The highlighting GitHub uses for Swift is here. You can see that it matches weak
with just a regular expression: (?<!\.)\bunowned\((?:safe|unsafe)\)|(?<!\.)\b(?:weak|unowned)\b
. You can see where the keywords are defined for the Stack Exchange highlighting here, and the associated logic here. This one makes a bit more of an effort, but still bundles all the keywords together undistinguished in one big array.
In both cases, these could resolve the specific cases you've identified with more complex matchers. The TextMate highlighting could use a negative lookbehind to check for var
or let
beforehand, for example. This would solve some of the issues, but not all of them; a zero-width lookahead for things like =
or .
probably cuts out a few more. None of these approaches are going to be perfect, because even with these sorts of enhanced regular expressions, this level of analysis fundamentally can't handle a context-free grammar like this.
For your C# example, the two set
s require even deeper syntactic analysis to distinguish, and again the existing highlighters don't try to. get; set;
is a common-enough pattern that a specific carve-out probably could be made, although catching all the other keyword uses of set
is still pretty hard. ... } set
has a keyword only inside a property block, but could appear anywhere else referring to the identifier.
I say "parser" rather than "grammar" here because most languages don't have a precise formal grammar: there's often a covering grammar, and then some additional processing that addresses some corner cases. If you're designing the language, you could make the job a little easier by sticking to a formal grammar, and so conceivably different highlighting engines could plug in your careful LALR grammar or something like that; in practice, this isn't how they mostly work at the moment, but in an ideal world it would help. This will still be large and likely even slower than a parser, but it is easily reusable, although in practice it's currently pretty useless for this particular purpose.
You could also make things even easier by not having any syntax-dependent elements at all: if every delimited token can only mean exactly one thing, then current regular-expression highlighters can do the job. If you want public highlighters to be able to do the job, this is probably your best bet. It means certainly no contextual keywords, but also limits other constructs that can't be identified correctly that way.
If you have a language that already has these elements and you're trying to implement a highlighter for it specifically, the answer is to use a real parser for the actual syntax of the language, and highlight based on the grammatical classes you assign to each source span. A pretty-printer from the parse tree can be helpful in this if you're generating fixed output, like the websites you've pointed at.
In practice, though, for existing languages and public highlighters, slight improvements to the regular expressions used for matching tokens are the best you can likely do. Eventually you may be able to push the corner cases so far out of the way that nobody actually encounters them, but the trade-off of ever-more-incomprehensible advanced regular expressions is a real one that becomes not worthwhile eventually (and at the point of having (?<!\.)\bunowned\((?:safe|unsafe)\)|(?<!\.)\b(?:weak|unowned)\b
, maybe it already has). These highlighters are just making a best-effort guess at being right often enough to be useful.