Improving syntax highlighting language auto-detection

Question

highlight.js's current auto-detection is arguably poor, due in large part to the large number of available languages it has to pick from when not given a language hint.

I suggest that when the language to use is ambiguous, pass to highlight.js all languages from tags on the question, instead of having it choose from every single language loaded on Stack Exchange. For example, a question tagged with javascript and css should call highlight.js with ['javascript', 'css'] as language hints, rather than with no hint; when no hint is given, the resulting highlighting is frequently wrong. This can easily be done by calling highlight.js's current API slightly differently.

Examples of current problems (see end of post for many more):

https://codereview.stackexchange.com/revisions/249795/1 On a question tagged with javascript algorithm object-oriented dictionary, code blocks in answers are auto-formatted as csharp and markdown and ini and kotlin, which are completely unrelated.
https://meta.stackoverflow.com/q/401573 On a question tagged with html jquery, a code block is auto-formatted as lua, which is completely unrelated.
https://meta.stackexchange.com/a/354695 Currently, we're trying to usually have at most one tag on a question which is linked to a specific highlight language. Otherwise, if there are 2 or more languages associated with a language's tags, it's ambiguous, and highlight.js is called to highlight a code block with no language hint at all, whose results are frequently inaccurate.

The logic currently being used is:

Some tags are associated with highlight languages. These associations can be seen at the bottom of the tag wiki page, eg on SO, javascript is associated with lang-js:

Code Language (used for syntax highlighting): lang-js
If a question has exactly one tag with an associated highlight language, all code blocks in the post get highlighted with that language.
If there are 2+ tags associated with a language, all code blocks in the post are highlighted by having highlight-js guess at the most appropriate language between all available languages (not just the languages associated with question tags, but with all possible languages SE has loaded), which doesn't work well.

My suggestion: Highlighting would be much more flexible and accurate if, in the case of 2+ associated tags, highlight.js was called with those tags' languages as hints, rather than with no hint at all. This not only improves the appearance of questions with multiple tags, it also allows for default languages to be associated with more tags. (We're currently trying to avoid using more than 2 tags associated with a highlight language on a question, which causes problems.) On SO, it's not that uncommon to see a question tagged with a subtag but not with the language's main tag, eg angular but not JavaScript, resulting in bad highlighting.

Technical front-end details on how this can be accomplished with a couple simple tweaks:

When a SE page is generated, a #js-codeblock-lang element is populated with the highlight language to use, if there is exactly one tag on the question associated with a language. Eg a question with javascript gets lang-js. A question with javascript xml gets default because both javascript and xml are associated with a language.

Through SE's JS, the content of this element gets set to the classes of all code blocks in the post. For example, <pre class="lang-xml s-code-block"> or <pre class="default s-code-block">.

When it comes time to style code blocks, SE runs:

StackExchange.using("highlightjs", function () {
    $("pre.s-code-block:not(.hljs)").each(function () {
        StackExchange.highlightjs.instance.highlightBlock(this);
    });
});

where highlightBlock (docs here) is the highlight-js function which highlights a code block. If the block has a language in the class attribute, that language will be used. If default, highlight-js will guess the most appropriate language from all of the tens of languages that are loaded. (This is the problem.)

Edit: Below was my original suggestion of a way automatic syntax highlighting could be accomplished better, but one of the highlighter's maintainer's, Josh, has a better suggestion below.

We can force highlight-js to choose the most appropriate language of a few languages by using highlightAuto instead of highlightBlock. Unlike highlightBlock, highlightAuto can accept a parameter of languages to choose from. For example, passing ['xml', 'js'] will ensure that the resulting code is either highlighted as xml or js (and not something completely unrelated like lua). highlightAuto also returns an object containing the new HTML markup, instead of modifying a passed DOM node.

As a proof of concept, for a test run of my suggestion, I replaced Stack Exchange's code block above with the following code (hidden in the snippet) and looked at a bunch of questions (which were originally highlighted incorrectly) to see how well auto-detection would perform given a small number of languages to choose between:

// I'm using a Stack Snippet here to hide a long code block by default
throw new Error('This is not runnable here');

// The following code is just an example of how one might use highlightAuto:

StackExchange.using("highlightjs", function () {
    // This example uses the below object instead of the server-sent language
    const langsByTag = {
        javascript: 'js',
        java: 'java',
        python: 'python',
        'c#': 'csharp',
        php: 'php',
        html: 'xml',
        jquery: 'js',
        // CSS auto-highlighting is broken for some reason
        // (a completely separate issue), but SCSS works well
        css: 'scss',
        typescript: 'ts',
    };
    const thisQuestionTags = [...$('.question .post-tag')].map(a => a.textContent);
    const langs = [...new Set(thisQuestionTags.map(tag => langsByTag[tag]))].filter(Boolean);
    $("pre.s-code-block:not(.hljs)").each(function () {
        const code = this.children[0];
        const codeText = code.textContent;
        const doHighlight = (result) => {
            code.innerHTML = result.value;
            // Clearly expose the detected highlighted language by putting it into the DOM:
            this.dataset.highlightLang = result.language;
        };
        const doHighlightWithoutLanguageHints = () => {
            doHighlight(StackExchange.highlightjs.instance.highlightAuto(codeText));
        };
        
        if (!langs.length) {
            doHighlightWithoutLanguageHints();
            return;
        }
        // Auto-detect language, but only permit a language from one of the tags on the question:
        const highlightResult = StackExchange.highlightjs.instance.highlightAuto(codeText, langs);
        if (highlightResult.relevance >= 3) {
            // Result relevance isn't horrible, use it:
            doHighlight(highlightResult);
        } else { 
            // Otherwise, result relevance is unexpectedly low; perhaps question is mistagged,
            // or the language or the code block does not have enough language-specific syntax
            // Auto-detect language from all loaded languages.
            // Might well be inaccurate, but it may be better than the prior result:
            doHighlightWithoutLanguageHints();
        }
    });
});

Here's a small sample of questions which used to be highlighted badly, but are now highlighted correctly, using the above code:

https://stackoverflow.com/q/63030994 An excellent example. In a question with java html css, all the code blocks used to be highlighted as Java. Now, the two HTML blocks are properly highlighted as XML, the CSS block is highlighted as CSS, and the two config blocks are highlighted as INI.
https://stackoverflow.com/q/64129300 TypeScript can now be highlighted properly (screenshot: before/after)
https://stackoverflow.com/q/55064068 In a question with javascript css, the CSS code block is now highlighted properly as CSS, not JS
https://stackoverflow.com/q/64093029 In a question with html jquery, the HTML code block is now highlighted properly as XML, not Lua
https://stackoverflow.com/q/61985511 In a question with php html, the HTML code block is now highlighted properly as XML, not PHP
https://stackoverflow.com/q/56120519 In a question tagged with javascript c#, the first code block is now highlighted properly as JavaScript, not Less
https://stackoverflow.com/q/61287492 In a question tagged with python html, the two HTML code blocks are now highlighted properly as XML, not Python
https://stackoverflow.com/q/53122772 In a question tagged with javascript java, the following code blocks are corrected properly: Kotlin -> Java, XML -> JavaScript, C# -> Java

And so on. These are easy to find. It still isn't perfect, but I think this would be a solid improvement over the current logic, and it only requires a small change in Stack Exchange's code. Pass all languages on a question's tags to #js-codeblock-lang, then call highlightAuto instead of highlightBlock.

This may be a nooby question, but why can't we choose which highlighter we want for the question? Sometimes highlight.js is better, and sometimes Prettify is better. Why can't we just choose which one instead of having SO choose it for us? — 10 Rep, Commented Sep 30, 2020 at 16:55
Highlighters require a significant amount of code. Allowing both would somewhat increase overall bundle size (which is an issue for those with bad connections). It'd also add another layer of programming (adding toggling options, and adding / removing the different highlighters' formatting dynamically) and would introduce potential confusion to the already-complicated process of how to ask and answer good questions. IMO, using a single well-maintained library instead makes things much simpler. — CertainPerformance, Commented Sep 30, 2020 at 17:04
Aside from language detection, I think a lot of the issues we're having with highlight.js are simply manifestations of users being resistant to change, as most of us are. I'm hopeful it'll grow on us. — CertainPerformance, Commented Sep 30, 2020 at 17:05
I'm happy to add this smarter support to my Chrome extension if someone wants to figure out the FULL tag -> hint table for me (ie, the list of which tags auto-hint to which languages)... it's probably not that hard (a script snippet could probably do it with fetch in the browser console, etc?). github.com/joshgoebel/se_highlightjs/issues/3 — Josh Goebel, Commented Oct 29, 2020 at 12:15
result relevance is unexpectedly low; perhaps question is mistagged Ah I just looked at what you are doing more closely... that's interesting. I was imagining that the tags auto-hints (xml, java, html in your first example) would simply be used as a boost or force multiplier (rather than exclusively). So we'd analyze everything as usual but then the html, java, and xml scores would get say a 50% boost (to boost them above any noise). Though I suppose you could still use either strategy after-the-fact, just with a small amount of extra CPU burn. — Josh Goebel, Commented Oct 29, 2020 at 19:32
@CertainPerformance If you have any thoughts I'd love to hear them. github.com/highlightjs/highlight.js/issues/2768 This is what I was thinking of for an auto-detect/classification plugin API. There is no way to do exactly what you're doing here (saving a few CPU cycles), but you could accomplish the exact same result with analysis in a after callback... Logic: Did any of the auto-hinted languages score > 3? If so, return just those results, else: return the all results. Your idea would shrink to several lines of very simple JS. — Josh Goebel, Commented Oct 29, 2020 at 19:48
Anothrer example: Some lines of code block are bold when the question has at least two language tags. — Sebastian Simon, Commented Jun 1, 2021 at 12:10

Josh Goebel · Accepted Answer · 2020-10-29 12:36:24Z

Update: I've written a Chrome extension to give us a place to land some of these ideas (and experiment with them) until hopefully they can be added to official SE one day. This fancier auto-detect isn't supported yet but there is an open issue for it if anyone wants to help. https://github.com/joshgoebel/se_highlightjs

Current maintainer of Highlight.js here. I wanted to weigh in on this. First off, a LOT of great ideas here. But I want to nitpick the leading comment just a tiny bit:

highlight.js's current auto-detection is arguably poor

It definitely can be argued. :-) It's definitely not perfect; and perhaps it's worse than Prettify, I don't know... perhaps you meant "poor by comparison" vs "poor in an absolute sense". [perhaps that's implied] ...but for a "best effort" feature I'd say our detection is "ok". Our language detection has always been "best effort" (based on our grammar rules) rather than "best in class". We do not consider ourself a "language classifier". Detection is a secondary feature to our primary feature: highlighting.

To be clear: That doesn't mean we aren't in favor of improving it when doable, just that it's not our primary focus.

But yes, can look quite terrible when we get it wrong for something that (as a human) seems impossibly simple to classify. Sometimes that's because our language detection is buggy [grammars rules are too broad]. In truly egregious situations this is likely the case (and we're open to fixing them, if possible). Sometimes it's because language classification is simply a hard problem. I do have a PR that nets a 4-5% detection improvement against one of the language-detection.el datasets. It should land in version 10.4 (Nov/Dec probably).

For a lot more context on this we have a long-lived thread:

https://github.com/highlightjs/highlight.js/issues/1213

Ok, now on to the good stuff:

The main idea here is definitely on the right track, but it can be done much simpler by just using our config setting for autodetect:

hljs.configure({languages: ["js","html","css"]})

This will scope the "global" language stack used by auto-detection (which is used by highlightBlock when the language is not specified).

So yes, SE should consider converting post tags to a list of language grammars and then scope the auto-detect accordingly (which is already built-in to the core library)... OR it should heavily weight the chosen tags (ie, if a post is tagged js/angular then JS and Angular would get a 80% "likely boost")... there is no built-in way to do this, but it shouldn't be more than 20-30 lines of code (essentially writing a custom highlightAuto with different scoring ideas)...

I'm also open to making this type of post-scoring "classification" easier to do via a plugin is anyone on the SE core team would like to discuss that. IE, after highlightAuto runs it would pass the raw results to a "classifier" plugin that is free to make its own decisions, based on its own criteria.

// Otherwise, result relevance is unexpectedly low; perhaps question is mistagged,

The past week or two I've had this exact thought... I'm not sure 3 is the right number, I had 5 in my mind... if someone wanted to contribute this to the core library I think that would be a great addition (#hacktoberfest). We could even discuss making this a configurable threshold... so essentially auto-detect would simply consider anything less than X as scoring 0 - rendering it as plaintext - rather than making a potentially wild guess.

Opened an issue to track this: github.com/highlightjs/highlight.js/issues/2768 — Josh Goebel, Commented Oct 17, 2020 at 19:54
It definitely can be argued Oh, sure. The main issue is in SE's implementation of it. But when given reasonable language hints, the autodetector works pretty well. Not perfectly, but when given no hints, it's a really hard problem to correctly automatically detect one of a whole bunch of languages, especially when actual parsers for the languages are too big / complicated to include. — CertainPerformance, Commented Oct 17, 2020 at 20:23
The main issue is in SE's implementation of it. In context here I wholly agree. SE has HUGE contextual info (via tags, etc) on what language these snippets likely are so it needs to share that with HLJS in order to [together] make the best classification decisions. I wager that would immediately nix 80% of the "hey my code is highlighted the wrong language" issues (and too often these are mislabeled as "highlighting just sucks". Seems like an easy win for SE... — Josh Goebel, Commented Oct 17, 2020 at 20:27

Des · Accepted Answer · 2021-02-19 21:37:59Z

0

Based on our current roadmap, we don't have immediate plans to work on this request however we'll review this in the context of the editor alpha feedback analysis that is coming up and provide more details here if/when it has been prioritized and/or completed.

answered Feb 19, 2021 at 21:37

DesStaffMod

4,2371 gold badge28 silver badges35 bronze badges

4

So debugging feature upgrades (like switching libraries used) that degrade front end presentation from it's predecessor get a 6-8 month not even 6-8 week priority? Numerous issues were reported when the switch to highlight.js was made 6 months ago and as far as I can see this is the first response to any of them by anyone involved with site operations. Stack Snippets bugs and requests for improvements lingering for years get totally ignored. Seems that the roadmaps disregard priorities of users somewhat akin to how politicians often ignore the kitchen table issues of their constituents.
– charlietfl
Commented Feb 20, 2021 at 14:20
1

I set it as deferred because the work won't happen this quarter however we do have plans to review all the feedback that we received on the editor this quarter (from meta and ongoing usability testing) and map out implementation plans for Q2. We reserved using status-planned for active development and status-deferred for things we know we want to prioritize in the near term. Hope that helps to clarify.
– Des StaffMod
Commented Feb 20, 2021 at 17:21

Add a comment |

Stack Exchange Network

Improving syntax highlighting language auto-detection

Technical front-end details on how this can be accomplished with a couple simple tweaks:

2 Answers 2

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged
feature-request
status-deferred
tags
syntax-highlighting
.

Linked

Hot Network Questions

Improving syntax highlighting language auto-detection

Technical front-end details on how this can be accomplished with a couple simple tweaks:

2 Answers 2

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged feature-requeststatus-deferredtagssyntax-highlighting.

Linked

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
feature-request
status-deferred
tags
syntax-highlighting
.