highlight.js's current auto-detection is arguably poor, due in large part to the large number of available languages it has to pick from when not given a language hint.
I suggest that when the language to use is ambiguous, pass to highlight.js all languages from tags on the question, instead of having it choose from every single language loaded on Stack Exchange. For example, a question tagged with javascript
and css
should call highlight.js with ['javascript', 'css']
as language hints, rather than with no hint; when no hint is given, the resulting highlighting is frequently wrong. This can easily be done by calling highlight.js's current API slightly differently.
Examples of current problems (see end of post for many more):
- https://codereview.stackexchange.com/revisions/249795/1 On a question tagged with
javascript algorithm object-oriented dictionary
, code blocks in answers are auto-formatted ascsharp
andmarkdown
andini
andkotlin
, which are completely unrelated. - https://meta.stackoverflow.com/q/401573 On a question tagged with
html jquery
, a code block is auto-formatted aslua
, which is completely unrelated. - https://meta.stackexchange.com/a/354695 Currently, we're trying to usually have at most one tag on a question which is linked to a specific highlight language. Otherwise, if there are 2 or more languages associated with a language's tags, it's ambiguous, and highlight.js is called to highlight a code block with no language hint at all, whose results are frequently inaccurate.
The logic currently being used is:
Some tags are associated with highlight languages. These associations can be seen at the bottom of the tag wiki page, eg on SO, javascript is associated with
lang-js
:Code Language (used for syntax highlighting): lang-js
If a question has exactly one tag with an associated highlight language, all code blocks in the post get highlighted with that language.
If there are 2+ tags associated with a language, all code blocks in the post are highlighted by having highlight-js guess at the most appropriate language between all available languages (not just the languages associated with question tags, but with all possible languages SE has loaded), which doesn't work well.
My suggestion: Highlighting would be much more flexible and accurate if, in the case of 2+ associated tags, highlight.js was called with those tags' languages as hints, rather than with no hint at all. This not only improves the appearance of questions with multiple tags, it also allows for default languages to be associated with more tags. (We're currently trying to avoid using more than 2 tags associated with a highlight language on a question, which causes problems.) On SO, it's not that uncommon to see a question tagged with a subtag but not with the language's main tag, eg angular
but not JavaScript, resulting in bad highlighting.
Technical front-end details on how this can be accomplished with a couple simple tweaks:
When a SE page is generated, a #js-codeblock-lang
element is populated with the highlight language to use, if there is exactly one tag on the question associated with a language. Eg a question with javascript
gets lang-js
. A question with javascript
xml
gets default
because both javascript
and xml
are associated with a language.
Through SE's JS, the content of this element gets set to the classes of all code blocks in the post. For example, <pre class="lang-xml s-code-block">
or <pre class="default s-code-block">
.
When it comes time to style code blocks, SE runs:
StackExchange.using("highlightjs", function () {
$("pre.s-code-block:not(.hljs)").each(function () {
StackExchange.highlightjs.instance.highlightBlock(this);
});
});
where highlightBlock
(docs here) is the highlight-js function which highlights a code block. If the block has a language in the class
attribute, that language will be used. If default
, highlight-js will guess the most appropriate language from all of the tens of languages that are loaded. (This is the problem.)
Edit: Below was my original suggestion of a way automatic syntax highlighting could be accomplished better, but one of the highlighter's maintainer's, Josh, has a better suggestion below.
We can force highlight-js to choose the most appropriate language of a few languages by using highlightAuto
instead of highlightBlock
. Unlike highlightBlock
, highlightAuto
can accept a parameter of languages to choose from. For example, passing ['xml', 'js']
will ensure that the resulting code is either highlighted as xml
or js
(and not something completely unrelated like lua
). highlightAuto
also returns an object containing the new HTML markup, instead of modifying a passed DOM node.
As a proof of concept, for a test run of my suggestion, I replaced Stack Exchange's code block above with the following code (hidden in the snippet) and looked at a bunch of questions (which were originally highlighted incorrectly) to see how well auto-detection would perform given a small number of languages to choose between:
// I'm using a Stack Snippet here to hide a long code block by default
throw new Error('This is not runnable here');
// The following code is just an example of how one might use highlightAuto:
StackExchange.using("highlightjs", function () {
// This example uses the below object instead of the server-sent language
const langsByTag = {
javascript: 'js',
java: 'java',
python: 'python',
'c#': 'csharp',
php: 'php',
html: 'xml',
jquery: 'js',
// CSS auto-highlighting is broken for some reason
// (a completely separate issue), but SCSS works well
css: 'scss',
typescript: 'ts',
};
const thisQuestionTags = [...$('.question .post-tag')].map(a => a.textContent);
const langs = [...new Set(thisQuestionTags.map(tag => langsByTag[tag]))].filter(Boolean);
$("pre.s-code-block:not(.hljs)").each(function () {
const code = this.children[0];
const codeText = code.textContent;
const doHighlight = (result) => {
code.innerHTML = result.value;
// Clearly expose the detected highlighted language by putting it into the DOM:
this.dataset.highlightLang = result.language;
};
const doHighlightWithoutLanguageHints = () => {
doHighlight(StackExchange.highlightjs.instance.highlightAuto(codeText));
};
if (!langs.length) {
doHighlightWithoutLanguageHints();
return;
}
// Auto-detect language, but only permit a language from one of the tags on the question:
const highlightResult = StackExchange.highlightjs.instance.highlightAuto(codeText, langs);
if (highlightResult.relevance >= 3) {
// Result relevance isn't horrible, use it:
doHighlight(highlightResult);
} else {
// Otherwise, result relevance is unexpectedly low; perhaps question is mistagged,
// or the language or the code block does not have enough language-specific syntax
// Auto-detect language from all loaded languages.
// Might well be inaccurate, but it may be better than the prior result:
doHighlightWithoutLanguageHints();
}
});
});
Here's a small sample of questions which used to be highlighted badly, but are now highlighted correctly, using the above code:
- https://stackoverflow.com/q/63030994 An excellent example. In a question with
java html css
, all the code blocks used to be highlighted as Java. Now, the two HTML blocks are properly highlighted as XML, the CSS block is highlighted as CSS, and the two config blocks are highlighted as INI. - https://stackoverflow.com/q/64129300 TypeScript can now be highlighted properly (screenshot: before/after)
- https://stackoverflow.com/q/55064068 In a question with
javascript css
, the CSS code block is now highlighted properly as CSS, not JS - https://stackoverflow.com/q/64093029 In a question with
html jquery
, the HTML code block is now highlighted properly as XML, not Lua - https://stackoverflow.com/q/61985511 In a question with
php html
, the HTML code block is now highlighted properly as XML, not PHP - https://stackoverflow.com/q/56120519 In a question tagged with
javascript c#
, the first code block is now highlighted properly as JavaScript, not Less - https://stackoverflow.com/q/61287492 In a question tagged with
python html
, the two HTML code blocks are now highlighted properly as XML, not Python - https://stackoverflow.com/q/53122772 In a question tagged with
javascript java
, the following code blocks are corrected properly: Kotlin -> Java, XML -> JavaScript, C# -> Java
And so on. These are easy to find. It still isn't perfect, but I think this would be a solid improvement over the current logic, and it only requires a small change in Stack Exchange's code. Pass all languages on a question's tags to #js-codeblock-lang
, then call highlightAuto
instead of highlightBlock
.
fetch
in the browser console, etc?). github.com/joshgoebel/se_highlightjs/issues/3after
callback... Logic: Did any of the auto-hinted languages score > 3? If so, return just those results, else: return the all results. Your idea would shrink to several lines of very simple JS.