Stop guessing/auto-detecting a language when you KNOW it will be incorrect

Question

The problem: SE asks Highlight.js to autodetect the language when it knows there isn’t any optimal/correct choice for us to make - resulting in very poor outcomes.

Disclaimer: I say this as the current Highlight.js maintainer

Example: SE currently does not load our groovy grammar. When one adds a Groovy block of code and hints it as ```groovy or , SE will still ask Highlight.js to auto-detect the language - even knowing the language is groovy and that they've purposely chosen not to enable our Groovy grammar.

This results in poor and inconsistent highlighting for many snippets and encourages bad user behavior that will only make the situation worse long-term. Auto-detect is not intended to be used to find "next best" matches for built-in grammars purposely excluded from a build. This will frequently result in highlighting that appears entirely random (based on variable names that match keywords, etc.).

List of reasons the existing behaviour is bad:

It makes users think a language is supported when it is not (this confusion is obvious in many threads following the switch to Highlight.js)
It results in incorrect/poor highlighting here and now (since the correct grammar is not available).
It results in seemingly random highlighting (different snippets of a single language end up highlighted with many different languages based on the exact content of the snippet).
- Worse, this can encourage people to mis-hint or mistag posts consistently (i.e., always using java instead of groovy) just to get more consistent highlighting. This has already been mentioned/suggested in other threads (see Groovy discussion).
- This mis-hinting/mistagging is not future-proof... if one day SO decides to add proper Groovy support, but older posts are tagged/hinted java (as a workaround)... those posts will not receive the new highlighting that would be possible if they had been hinted properly.
It can encourage hinting snippets with none (to avoid terrible auto-formatting) or even picking a random language just to find something that looks "better".
- This is also not future proof in that if the missing language is ever added in the future the incorrect suboptimal hint will continue to be used indefinitely.
It can encourage users to endlessly fiddle with their snippet just to see if they can "push" the highlighter towards a better choice.

What should happen instead:

If it's known that the language requested is not supported then one of several things should happen:

No highlighting should be used, i.e. alias to none or plaintext. Unfortunate, but consistent.
The next closest match should be hard-coded as an alias. You're already doing this for some languages, like your VBScript → VB.NET mapping.
- This results in consistent behaviour (keywords will always be highlighted the same from snippet to snippet).
- Users can learn the pros and cons of this behaviour (i.e., its quirks, etc...)
- If/when additional language support added in the future, the alias is removed and all existing posts that are correctly hinted are immediately "upgraded" will full and correct highlighting.
Lazy-load individual grammars (if it's not part of the default bundle) via a CDN and then perform highlighting as normal.

In summary:

No highlighting should be preferred over random highlighting for hinted snippets where SE has purposely chosen not to load a grammar module. Lazy-loading of grammars or manual hinting of alternatives (i.e., "java is a reasonable approx. of groovy") are some other options.

Also: no formatting may be a better choice for all snippets that have an explicit hint than cannot be resolved to any known language - though that's likely a larger discussion.

This was prompted by the Groovy discussion among others: What happened to Groovy syntax highlighting?

A small auto-detect primer and why this is a "worst-case" scenario for auto-detect.

Highlight.js auto-detection is based on analyzing a code snippet with all available language grammars and scoring its relevancy with each. The highest score "wins". While the keyword class or a variable named $blah is somewhat relevant in indicating a given piece of code might be PHP - the tag <?php is highly relevant, as it only ever appears in PHP templates. We're looking for which language seems to be the most "relevant" for a particular code snippet.

Let’s say we're asked to auto-detect the language and we find (in a perfect world) relevance scores something like:

C++:    9
SQL:    10
Java:   11
Groovy: 102

The code in question registers as 10x more "relevant" for Groovy, so it is highly likely this is a Groovy snippet. So what happens if the Groovy grammar isn't loaded - if we have no idea what Groovy code even is? You often end up with scoring much more like:

C++:  10
SQL:  9
Java: 10
Dart: 8
Go:   11

Our code now poorly matches whatever is left (since the correct answer [of Groovy] is no longer possible). The exact relevance values will of course change (depending on the snippet of code) and may not be this dramatic - but without the correct grammar loaded it's far more likely there is no clear winner... making the final language auto-detected much more of a coin toss.

This isn't a perfect example, but hopefully it is illustrative.

I think the most important thing here is to avoid the appearance of "random" behavior with the highlighting. When someone goes to the trouble to manually specify the correct language the last thing they should expect to see is entirely random highlighting. — Josh Goebel, Commented Oct 27, 2020 at 12:33
that is also in part because people don't check which languages are supported, and hence assume that certain behaviour is supported whilst it isn't. Which then in the end leads to confusion. — Luuklag, Commented Oct 27, 2020 at 12:36
SE's current behavior doesn't do anything to help with this confusion though. There is a reason our default behavior (as a library) is to fallback to NO highlighting when an incorrect or unknown language is provided. It provides feedback... makes it clear something is wrong, ie that the language "is not supported". Perhaps the user typed it wrong. We also log an error to the console, but SE could of course make a visual warning if they chose: groovy is not a supported language currently - [link to supported list]. They could even do this during composition. — Josh Goebel, Commented Oct 27, 2020 at 12:41
That's a big reason we default to "no highlight"... it's a warning that's hard to miss - doesn't require reading - yet provides immediate feedback to the user. :-) And if someone was a regular SO user they'd very quickly learn what no highlighting signaled: an unsupported language or a typo in language name. Inconsistent highlighting is a lot harder to "see" at a glance, making it far worse feedback. Perhaps even impossible without multiple examples. — Josh Goebel, Commented Oct 27, 2020 at 13:01
I fully agree with the question, but I wonder what the automatic detection is then good for? It seems to be only good for cases where the content creator itself has no clue what the language is of the code, he/she just provided. And then we would again get potential random behavior. Automatic detection can always lead to random behavior, or not? As for Groovy, StackOverflow could just enable the Groovy grammar. Does the highlight.js auto-detection gets things more often wrong if the number of possible languages increases? — NoDataDumpNoContribution, Commented Oct 27, 2020 at 14:29
Auto-detect allows post authors to use less effort - you don't need to think about tagging every snippet. This can often work very well when paired with post tag context... i.e. a post tagged "javascript" is far more likely to be javascript than say sql. (Though SE still has tons of room to improve here also.) I'm not suggesting we remove ALL auto-detection, only auto-detect where it's known in advance the outcome is likely to be poor. (such as when a grammar is requested that SE has chosen not to load) — Josh Goebel, Commented Oct 27, 2020 at 14:54
As for Groovy, StackOverflow could just enable the Groovy grammar. I'm pretty sure it's a space/size concern not a reliability concern. Every language makes their site slower to download... our library is almost 1mb if you include every language, yet ~50kb for a small popular set of languages. Does the highlight.js auto-detection gets things more often wrong if the number of possible languages increases? That's certainly a possibility (more to choose from) though SE could mitigate it greatly with smarter usage of tags to clue the auto-detection. — Josh Goebel, Commented Oct 27, 2020 at 14:56
Automatic detection can always lead to random behavior... No... the hope is you get predictable behavior based on relevant content (when there is enough signal in the snippet) - not random behavior. But when you remove the "right" answer (purposely don't load a grammar, etc.) and then make the highlighter choose from a bunch of sub-optimal answers [none of them good matches]... you're far more likely to get randomness since you have no idea what the correct signal even is. — Josh Goebel, Commented Oct 27, 2020 at 15:06
It would also be possible to lazy-load a more complete set (or just a differentset) of grammars based on either the specified language or the language specified as the default for the tag (currently, there's a very limited list of permitted tag default syntax highlighting languages, so that would need to change). Being able to load more languages, or even the entire language set, doesn't require that the entire language set be downloaded for every page. It's more complicated, but SE already has a mechanism in place in the JavaScript to lazy-load additional packages when needed/desired. — Makyen, Commented Oct 27, 2020 at 17:22
Absolutely. Lazy-loading is the answer if the goal is to correctly highlight as many languages as possible as correctly as possible. Though SE has mentioned the monetary bandwidth costs as well, not just the bundle size from a time perspective. Creative client-side caching would help a lot there I think because once you downloaded a grammar a single time there’s no need to ever download it again. At least not until a new version of library is used. — Josh Goebel, Commented Oct 27, 2020 at 17:38
One way to approach this is by ending the practice of tags lacking a specified highlighting language triggering auto-detect... but that may do more harm then good, since there are plenty of tags where a language simply doesn't make sense, especially the ones denoting concepts rather than libraries or languages (eg. array). I think you nailed it in your comment- the best case scenario from a usage POV would be to lazy load it if the language specified/ detected wasn't already loaded. — zcoop98, Commented Oct 27, 2020 at 22:06
@zcoop98 if the language specified/ detected wasn't already loaded This wording is a bit confusing. A language grammar must first be loaded before it can be auto-detected... so there is no such thing as "This looks like Groovy, so now lazy load Groovy". But if a post was hinted groovy then SE could choose to lazy load Groovy and use that explicitly. Or if a post was tagged groovy (among other things) then SE could lazy load groovy before highlighting and then auto-detect would consider groovy as a possibility when doing the analysis. — Josh Goebel, Commented Oct 27, 2020 at 23:54
ending the practice of tags lacking a specified highlighting language triggering auto-detect As you say that might be too radical. Really a list of all valid language tags is necessary... so that when given a tag the JS can query "is groovy a language tag or a generic concept tag"? And if it's a language (one that's simply not in the default bundle) then that would either key the lazy-load - or simply turn off highlighting for that block. — Josh Goebel, Commented Oct 28, 2020 at 0:01
Oh yeah! Wanted to plug this relevant user script by @LionelRowe that does implement lazy loading of highlight.js language libraries. It only works when the language is specifically specified by a lang-X identifier (rather than a tag), however, it does succeed in highlighting languages currently unsupported by SE. — zcoop98, Commented Oct 28, 2020 at 14:56

Adam Lear · Accepted Answer · 2021-04-01 19:52:42Z

11

This is a hard problem for us due to many years of tech debt and how most users expect the site to act. We have a few potential options here:

Support every language hljs supports. This is the ideal solution, but has some barriers due to our scale (network costs).
Turn off highlighting entirely unless a valid language is specified or a tag with a language set is added to the post. This essentially kills auto-highlighting entirely. This is the "most correct" solution, but it's far from ideal for reasons animuson mentioned.
Make all auto-highlighting fall back to a generic highlighter (prettify did this). This gets us part of the way, but is also far from ideal.
Make all auto-highlighting fall back to a generic highlighter if a more suitable language isn't found. This sounds like the most ideal solution, but what value do we use for the confidence interval on the language detection? Is hljs consistent in their relevance scores? Is this something we can even do reliably? There are a lot of unknowns here.

We'd like to explore this further at some point, and I personally think option 4 is interesting, if we can identify and address some of the unknowns. Having said that, I don't anticipate that we'll be able to tackle this anytime soon, so I'm tagging this as status-deferred for the time being.

answered Apr 1, 2021 at 19:52

Adam LearStaffMod

160k45 gold badges499 silver badges687 bronze badges

9

If I read Josh Goebel's post correctly he suggests option 2b: if an unsupported language hint is specified, then set highlighting to none. In all other cases (for example no language hint, or tags without a default), do autodetection. This should be fairly easy to implement, and it would improve the results (no highlighting is better than wrong highlighting). Maybe people would stop adding the language hint when they find out it defaults to none, so they should be reminded somehow that they should continue, but it would surely be an improvement over the current implementation.
– Marijn
Commented Apr 1, 2021 at 21:03
1

Would the network costs for point 1 be lower when we don't serve everyone the entire library of languages, but subdivide them into several groups, with smallest size n=1 (an optimum can be at n=X). So that we serve smaller libraries, but possibly more often?
– Luuklag
Commented Apr 1, 2021 at 21:42
7

@Marijn Yes you understood me correctly. The problem is someone says "This is Pascal code" and SE says "here, I will give you some random language (not Pascal) with NO explanation WHY and without showing any error telling the user "hey pascal is not a valid language"... it entirely subverts users expectations when providing a MANUAL language hint. The manual language hint should either succeed or fail, not result in entirely RANDOM behavior. Adam: What would the harm be in correcting this one element?
– Josh Goebel
Commented Apr 2, 2021 at 22:17
3

Is the problem a huge history of existing posts with ridiculous/incorrect manual hinting of languages? I can't think of any other good reason not to implement that behavior. It would be line going to a diner and ordering a burger and then getting 5 milkshakes and with no explanation. If there aren't burgers available, they say so - or serve nothing at all. Don't give people milkshakes they don't want.
– Josh Goebel
Commented Apr 2, 2021 at 22:18
5

I don’t think point 2 is cogent, and I don’t see a cogent argument against this from animuson either: if a code snippet is tagged with a language that isn’t supported by Stack Overflow’s highlighter, then it shouldn’t be highlighted at all. I honestly find this pretty obvious. Prettify’s behaviour of applying random “makeup” was never justified: it completely misses the point of syntax highlighting, which, after all, isn’t to make code more colourful, but to make it more readable by highlighting semantically meaningful tokens. Prettify performed cargo cult highlighting.
– Konrad Rudolph
Commented May 17, 2021 at 22:21
3

Would a server side syntax highlighter be considered? Leave the dynamic highlighter for the preview/editor.
– Braiam
Commented Jul 15, 2022 at 18:02
1

I've started giving up and am explicitly tagging unsupported languages as lang-none because the current behavior is so broken. It's a shame because now these code blocks are incorrectly tagged.
– mbauman
Commented Feb 23, 2023 at 21:22

Add a comment |

animuson · Accepted Answer · 2021-02-15 18:19:05Z

4

+50

So, you are discussing a few very different things in this post, and you have some false assumptions in there.

On automatic detection

Completely disabling automatic language detection in Highlight.js is completely off the table. It may be detrimental in the singular case that you have provided, but is not true for many other, much more popular languages.

The most common case is the combination of JavaScript, HTML, and CSS. Because these languages are so frequently mixed together in one question, we do not attempt to tell Highlight.js which language a code block might be, always preferring "default" for those tags. It is up to the highlighter to determine what type of code is in those blocks in a lot of cases, and simply leaving them as plain-text would definitely not be preferable there.

It doesn't sound like that's what you're really asking for here, though, despite some implication that it might be the catch-all solution.

On individual cases

Even if a language identifier is not explicitly aliased in the code, it is still possible to have a tag use another language by default. Any diamond moderator can change the default language for a tag to anything available - it is not hard-coded anywhere and does not need to match anything. If there is a better language that would serve as a default for a tag than "default" then raise the request on the per-site meta to have it set to that.

Tags can even be set to the "<none>" option if no syntax highlighting should ever be used for code blocks under that tag unless explicitly overridden. If you believe Groovy questions should by default not be highlighted at all over having faulty highlighting, then again that is a request that can be made on the per-site meta.

So given that, I'm not sure what there really is to do here. We would not turn it off completely because that would break detection for other tags and we already provide the tools to either set it to another similar language or none at all. You just have to ask for the tool to be used. Has anyone posted on Meta Stack Overflow for this case requesting the language hint for Groovy be changed to none?

answered Feb 15, 2021 at 18:19

animusonStaffMod

189k38 gold badges577 silver badges848 bronze badges

2

@Luuklag Which is why I asked if the issue has been appropriately brought up on Meta Stack Overflow yet. But there doesn't seem to be anything that needs implemented here.
– animuson StaffMod
Commented Feb 15, 2021 at 19:48
1

This change certainly should not be applied to all tags. It should only be done on a case by case basis where it makes sense to do so.
– animuson StaffMod
Commented Feb 15, 2021 at 19:51
2

There are literally thousands of tags that are subsets of other languages or generally apply to another language, but users do not correctly add a tag that suggests the correct formatting. Setting everything always to none unless changed would result in a lot of code that could very easily be highlighted correctly being plaintext. That is not a better solution. Flip it around: we've rarely ever had a complaint about highlighting being so wrong for a language that it should be disabled completely for that tag. So why would we jump the gun and just blanket apply it to all for one complaint?
– animuson StaffMod
Commented Feb 15, 2021 at 19:55
2

As an aside, for context, auto detection was off when Highlight.js was first pushed out here, and we immediately got a bunch of complaints about code blocks not being highlighted that should be. So it was enabled (although it being disabled was not an intended change, users certainly let us know it was broken quickly)..
– animuson StaffMod
Commented Feb 15, 2021 at 19:56
1

I don't know if things have been changed since I last looked (IIRC, earlier this year), but the list of languages to which each tag can be set is restricted to a limited subset of the languages which are supported by the SE syntax highlighting. At the time I checked, the available list was not all-inclusive of the supported languages and even if an AJAX request was crafted and sent specifying a language stated as supported, but not in the drop-down list, then the restriction was applied by the backend (i.e. AJAX request resulted in the value not set).
– Makyen
Commented Feb 15, 2021 at 20:50
1

@Makyen It is still a subset, but glancing over it the only things I see missing as options in the dropdown are things like HTTP Headers, Ini and TOML, JSON, and Makefiles. All of the languages seem to be there, even newer ones like Kotlin. It's also... not hard for a developer to just update that list if some of those that aren't listed need to be.
– animuson StaffMod
Commented Feb 15, 2021 at 21:05
2

It would be helpful if the missing ones were added. The ones you listed reminded me of which tag I'd wanted to specify a supported, but unavailable, language for: the json tag. I'd expect that we'd want to be able to set makefile, and others, too.
– Makyen
Commented Feb 15, 2021 at 21:21
2

BTW: is there an official list of what languages are selected from when default is set? From the performance I've seen, it appears to be a subset of all supported languages, rather than auto-selecting from all of the supported languages.
– Makyen
Commented Feb 15, 2021 at 21:24
2

This is all pretty flexible, but as Josh describes it falls apart when languages aren't loaded - and that's not just stuff like Groovy (which are in the core repo), there are a pile of satellite repos for other languages as well - so far as I can tell, the best solution for any of those is to explicitly mark the tag as "none" so as to disable highlighting on questions so-tagged, and then encourage folks to specify the language code or tag name when writing, making them at least somewhat future-proof.
– Shog9
Commented Feb 15, 2021 at 23:01
8

@animuson My very first comment on my post answers you: I think the most important thing here is to avoid the appearance of "random" behavior with the highlighting. When someone goes to the trouble to manually specify the correct language the last thing they should expect to see is entirely random highlighting. If SE is purposely not going to include some languages we fully support (like Groovy) and yet knows (via tags or manual hinting) that a snippet is Groovy, it should NOT just ask us to randomly detect the language - because at that point there are only wrong answers.
– Josh Goebel
Commented Feb 16, 2021 at 8:14
1

To be clear when I say "posting Groovy code" I should say "explicitly"... as in they are using a code fence (triple backpack) and explicitly labeled as groovy block... if groovy isn't found in the loaded list of SE highlight languages, then no highlighting should be preferred. (vs random and inconsistent highlighting)
– Josh Goebel
Commented Feb 16, 2021 at 8:30
2

@JoshGoebel And we have an existing tool to set that tag to no highlighting if it is requested. This problem is very easily solved by just posting a request on Meta Stack Overflow asking for the language hint for the [groovy] tag to be changed to none and explaining why there.
– animuson StaffMod
Commented Feb 16, 2021 at 15:22
3

I know you mentioned accidentally turning off highlighting and users preferring it on but still using your idea of default (which is truly just random for all non-supported languages) still feels entirely backwards. It breaks expectations. I say groovy, but what I get is random. You should instead perhaps create a default grammar that highlights common keywords (if, else, etc) consistently. That way behavior is predictable, vs random. Again, I'm talking about explicitly labeled blocks.
– Josh Goebel
Commented Feb 17, 2021 at 12:21
6

I feel like the whole larger point of my message is being missed. "When a language isn't available we fall back to default." There is no default. Default is an illusion. SE should be honest and just call it "random", not default. Perhaps many users do prefer random highlighting... but it seems hard to comprehend.
– Josh Goebel
Commented Feb 17, 2021 at 12:24
4

@animuson We now consider your behavior in this regard entirely broken and we have created a primary issue that will track this that we will continue to refer people to (closing opened issue on our end as duplicates) until this is resolved: github.com/highlightjs/highlight.js/issues/3183 We'll also refer people here and to contact StackOveflow support to complain since we can't do anything on our end to fix this.
– Josh Goebel
Commented May 8, 2021 at 8:18

| Show 6 more comments

Stack Exchange Network

Stop guessing/auto-detecting a language when you KNOW it will be incorrect

A small auto-detect primer and why this is a "worst-case" scenario for auto-detect.

2 Answers 2

On automatic detection

On individual cases

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged
feature-request
status-deferred
code-formatting
syntax-highlighting
language-hints
.

Linked

Hot Network Questions

Stop guessing/auto-detecting a language when you KNOW it will be incorrect

A small auto-detect primer and why this is a "worst-case" scenario for auto-detect.

2 Answers 2

On automatic detection

On individual cases

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged feature-requeststatus-deferredcode-formattingsyntax-highlightinglanguage-hints.

Linked

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
feature-request
status-deferred
code-formatting
syntax-highlighting
language-hints
.