125

I'm writing a Chrome extension that involves doing a lot of the following job: sanitizing strings that might contain HTML tags, by converting <, > and & to &lt;, &gt; and &amp;, respectively.

(In other words, the same as PHP's htmlspecialchars(str, ENT_NOQUOTES) – I don't think there's any real need to convert double-quote characters.)

This is the fastest function I have found so far:

function safe_tags(str) {
    return str.replace(/&/g,'&amp;').replace(/</g,'&lt;').replace(/>/g,'&gt;') ;
}

But there's still a big lag when I have to run a few thousand strings through it in one go.

Can anyone improve on this? It's mostly for strings between 10 and 150 characters, if that makes a difference.

(One idea I had was not to bother encoding the greater-than sign – would there be any real danger with that?)

5
  • 2
    Why? In most cases that you want to do this, you want to insert the data into the DOM, in which case you should forget about escaping it and just make a textNode from it.
    – Quentin
    Commented Mar 31, 2011 at 11:30
  • 1
    @David Dorward: perhaps he wanted to sanitize POST data, and the server does not round-trip the data correctly.
    – Lie Ryan
    Commented Mar 31, 2011 at 11:35
  • 4
    @Lie — if so, then the solution is "For Pete's sake, fix the server as you have a big XSS hole"
    – Quentin
    Commented Mar 31, 2011 at 13:12
  • 2
    @David Dorward: it is possible that the case is he do not have control over the server. I've been into such situation recently where I was writing a greasemonkey script to workaround a couple of things I don't like in my university's website; I had to do a POST on a server that I do not have control to and sanitize POST data using javascript (since the raw data comes from a rich textbox, and so has heaps of html tags which does not do round trip on the server). The web admin was ignoring my request for them to fix the website, so I had no other choice.
    – Lie Ryan
    Commented Mar 31, 2011 at 13:40
  • 1
    I have a use-case where I need to display an error message in a div. The error message can contain HTML and newlines. I want to escape the HTML and replace the newlines with <br>. Then put the result into a div for display.
    – mozey
    Commented Jul 29, 2013 at 9:09

13 Answers 13

135

Here's one way you can do this:

var escape = document.createElement('textarea');
function escapeHTML(html) {
    escape.textContent = html;
    return escape.innerHTML;
}

function unescapeHTML(html) {
    escape.innerHTML = html;
    return escape.textContent;
}

Here's a demo.

8
  • Redesigned the demo. Here's a fullscreen version: jsfiddle.net/Daniel_Hug/qPUEX/show/light Commented May 2, 2013 at 15:25
  • 19
    Not sure how/what/why - but this is genius.
    – rob_james
    Commented Jun 18, 2014 at 12:12
  • 5
    Looks like it is leveraging the TextArea element's existing code for escaping literal text. Very nice, I think this little trick is going to find another home.
    – Ajax
    Commented Jan 4, 2016 at 8:41
  • 3
    @jazkat I'm not using that function. The escape variable I use, I define myself in the example. Commented Jul 4, 2017 at 0:08
  • 2
    but does this lose white space etc.
    – Andrew
    Commented Jan 14, 2018 at 19:41
101

You could try passing a callback function to perform the replacement:

var tagsToReplace = {
    '&': '&amp;',
    '<': '&lt;',
    '>': '&gt;'
};

function replaceTag(tag) {
    return tagsToReplace[tag] || tag;
}

function safe_tags_replace(str) {
    return str.replace(/[&<>]/g, replaceTag);
}

Here is a performance test: http://jsperf.com/encode-html-entities to compare with calling the replace function repeatedly, and using the DOM method proposed by Dmitrij.

Your way seems to be faster...

Why do you need it, though?

12
  • 2
    There is no need to escape >.
    – user142019
    Commented Mar 10, 2013 at 13:50
  • 8
    Actually if you put the escaped value in an html element's attribute, you need to escape the > symbol. Otherwise it would break the tag for that html element. Commented Oct 7, 2013 at 15:42
  • 2
    In normal text escaped characters are rare. It's better to call replace only when needed, if you care about max speed: if (/[<>&"]/.test(str) { ... }
    – Vitaly
    Commented Oct 26, 2014 at 4:22
  • 7
    @callum: No. I am not interested in enumerating cases in which I think "something could go wrong" (not least because it's the unexpected/forgotten cases that'll hurt you, and when you least expect it at that). I am interested in coding to standards (so the unexpected/forgotten cases can't hurt you by definition). I can't stress how important this is. > is a special character in HTML, so escape it. Simple as that. :) Commented Jul 20, 2015 at 15:30
  • 4
    @LightnessRacesinOrbit It's relevant because the question is what is the fastest possible method. If it's possible to skip the > replacement, that would make it faster.
    – callum
    Commented Jul 20, 2015 at 17:37
31

Martijn's method as a prototype function:

String.prototype.escape = function() {
    var tagsToReplace = {
        '&': '&amp;',
        '<': '&lt;',
        '>': '&gt;'
    };
    return this.replace(/[&<>]/g, function(tag) {
        return tagsToReplace[tag] || tag;
    });
};

var a = "<abc>";
var b = a.escape(); // "&lt;abc&gt;"
4
  • 13
    Add to String like this it should be escapeHtml since it's not an escaping for a String in general. That is String.escapeHtml is correct, but String.escape raises the question, "escape for what?" Commented Mar 13, 2014 at 3:12
  • 3
    Yeah good idea. I've moved away from extending the prototype these days to avoid conflicts. Commented Mar 13, 2014 at 23:34
  • 1
    If your browser has support for Symbol, you could use that instead to avoid polluting the string-key namespace. var escape = new Symbol("escape"); String.prototype[escape] = function(){ ... }; "text"[escape]();
    – Ajax
    Commented Jan 4, 2016 at 8:58
  • plus one for the example.
    – Timo
    Commented Sep 30, 2020 at 18:12
22

An even quicker/shorter solution is:

escaped = new Option(html).innerHTML

This is related to some weird vestige of JavaScript whereby the Option element retains a constructor that does this sort of escaping automatically.

Credit to https://github.com/jasonmoo/t.js/blob/master/t.js

2
  • 5
    Neat one-liner but the slowest method after regex. Also, the text here can have whitespace stripped, according to the spec
    – ShortFuse
    Commented Jan 6, 2020 at 19:25
  • Note that @ShortFuse's "slowest method" link makes my system run out of RAM (with ~6GB free) and firefox seems to stop allocating just before it's out of memory so instead of killing the offending process, linux will sit there and let you do a hard power off.
    – Luc
    Commented Jul 11, 2020 at 9:09
16

The fastest method is:

function escapeHTML(html) {
    return document.createElement('div').appendChild(document.createTextNode(html)).parentNode.innerHTML;
}

This method is about twice faster than the methods based on 'replace', see http://jsperf.com/htmlencoderegex/35 .

Source: https://stackoverflow.com/a/17546215/698168

2
  • JSPerf shut-down in 2017, unfortunately - can you repost it to jsbench.me ?
    – Dai
    Commented Nov 3, 2022 at 2:01
  • @Dai : unfortunately I cant repost it as I'm not the benchmark author. Commented Nov 3, 2022 at 8:20
13

The AngularJS source code also has a version inside of angular-sanitize.js.

var SURROGATE_PAIR_REGEXP = /[\uD800-\uDBFF][\uDC00-\uDFFF]/g,
    // Match everything outside of normal chars and " (quote character)
    NON_ALPHANUMERIC_REGEXP = /([^\#-~| |!])/g;
/**
 * Escapes all potentially dangerous characters, so that the
 * resulting string can be safely inserted into attribute or
 * element text.
 * @param value
 * @returns {string} escaped text
 */
function encodeEntities(value) {
  return value.
    replace(/&/g, '&amp;').
    replace(SURROGATE_PAIR_REGEXP, function(value) {
      var hi = value.charCodeAt(0);
      var low = value.charCodeAt(1);
      return '&#' + (((hi - 0xD800) * 0x400) + (low - 0xDC00) + 0x10000) + ';';
    }).
    replace(NON_ALPHANUMERIC_REGEXP, function(value) {
      return '&#' + value.charCodeAt(0) + ';';
    }).
    replace(/</g, '&lt;').
    replace(/>/g, '&gt;');
}
1
  • 1
    Wow, that non-alphanum regex is intense. I don't think the | in the expression is needed though.
    – Ajax
    Commented Jan 4, 2016 at 9:14
10

All-in-one script:

// HTML entities Encode/Decode

function htmlspecialchars(str) {
    var map = {
        "&": "&amp;",
        "<": "&lt;",
        ">": "&gt;",
        "\"": "&quot;",
        "'": "&#39;" // ' -> &apos; for XML only
    };
    return str.replace(/[&<>"']/g, function(m) { return map[m]; });
}
function htmlspecialchars_decode(str) {
    var map = {
        "&amp;": "&",
        "&lt;": "<",
        "&gt;": ">",
        "&quot;": "\"",
        "&#39;": "'"
    };
    return str.replace(/(&amp;|&lt;|&gt;|&quot;|&#39;)/g, function(m) { return map[m]; });
}
function htmlentities(str) {
    var textarea = document.createElement("textarea");
    textarea.innerHTML = str;
    return textarea.innerHTML;
}
function htmlentities_decode(str) {
    var textarea = document.createElement("textarea");
    textarea.innerHTML = str;
    return textarea.value;
}

http://pastebin.com/JGCVs0Ts

5
  • I didn't downvote, but all regex style replace will fail to encode unicode... So, anyone using a foreign language is going to be disappointed. The <textarea> trick mentioned above is really cool and handles everything quickly and securely.
    – Ajax
    Commented Jan 4, 2016 at 8:59
  • 1
    The regex works fine for me with a number of non-Latin Unicode characters. I wouldn't expect anything else. How do you think this wouldn't work? Are you thinking of single-byte codepages that require HTML entities? That's what the 3rd and 4th function are for, and explicitly not the 1st and second. I like the differentiation.
    – ygoe
    Commented Feb 29, 2016 at 17:30
  • @LonelyPixel I don't think he will see your comment if you don't mention him ("Only one additional user can be notified; the post owner will always be notified")
    – baptx
    Commented Feb 29, 2016 at 19:31
  • I didn't know targeted notifications exist at all. @Ajax please see my comment above.
    – ygoe
    Commented Mar 1, 2016 at 8:00
  • @LonelyPixel I see now. For some reason I didn't think there was a textarea style replacement in this answer. I was, indeed, thinking of double codepoint big unicode values, like Mandarin. I mean, it would be possible to make a regex smart enough, but when you look at the shortcuts that browser vendors can take, I would feel pretty good betting that textarea will be much faster (than a completely competent regex). Did someone post a benchmark on this answer? I swore I had seen one.
    – Ajax
    Commented Mar 2, 2016 at 2:41
4

function encode(r) {
  return r.replace(/[\x26\x0A\x3c\x3e\x22\x27]/g, function(r) {
	return "&#" + r.charCodeAt(0) + ";";
  });
}

test.value=encode('How to encode\nonly html tags &<>\'" nice & fast!');

/*
 \x26 is &ampersand (it has to be first),
 \x0A is newline,
 \x22 is ",
 \x27 is ',
 \x3c is <,
 \x3e is >
*/
<textarea id=test rows=11 cols=55>www.WHAK.com</textarea>

3

I'll add XMLSerializer to the pile. It provides the fastest result without using any object caching (not on the serializer, nor on the Text node).

function serializeTextNode(text) {
  return new XMLSerializer().serializeToString(document.createTextNode(text));
}

The added bonus is that it supports attributes which is serialized differently than text nodes:

function serializeAttributeValue(value) {
  const attr = document.createAttribute('a');
  attr.value = value;
  return new XMLSerializer().serializeToString(attr);
}

You can see what it's actually replacing by checking the spec, both for text nodes and for attribute values. The full documentation has more node types, but the concept is the same.

As for performance, it's the fastest when not cached. When you do allow caching, then calling innerHTML on an HTMLElement with a child Text node is fastest. Regex would be slowest (as proven by other comments). Of course, XMLSerializer could be faster on other browsers, but in my (limited) testing, a innerHTML is fastest.


Fastest single line:

new XMLSerializer().serializeToString(document.createTextNode(text));

Fastest with caching:

const cachedElementParent = document.createElement('div');
const cachedChildTextNode = document.createTextNode('');
cachedElementParent.appendChild(cachedChildTextNode);

function serializeTextNode(text) {
  cachedChildTextNode.nodeValue = text;
  return cachedElementParent.innerHTML;
}

https://jsperf.com/htmlentityencode/1

2

Martijn's method as single function with handling " mark (using in javascript) :

function escapeHTML(html) {
    var fn=function(tag) {
        var charsToReplace = {
            '&': '&amp;',
            '<': '&lt;',
            '>': '&gt;',
            '"': '&#34;'
        };
        return charsToReplace[tag] || tag;
    }
    return html.replace(/[&<>"]/g, fn);
}
1
2

I'm not entirely sure about speed, but if you are looking for simplicity I would suggest using the lodash/underscore escape function.

1

Based on the comment by @Vitaly, and the fact the OP indicated the text "might contain HTML tags", the following helps significantly with text that rarely needs to be escaped.

  static escapeHtml(str) {
    return !(/[<>&"']/.test(str)) ? str :
           str.replaceAll('&', '&amp;')   .replaceAll('<', '&lt;')
              .replaceAll('>', '&gt;')    .replaceAll('"', '&quot;');
  }
-5

A bit late to the show, but what's wrong with using encodeURIComponent() and decodeURIComponent()?

3
  • 1
    Those do something completely unrelated
    – callum
    Commented Apr 4, 2018 at 16:22
  • 2
    Perhaps the biggest abuse of the word "completely" I have ever heard. For example, in relation to the main topic question, it could be used to decode a html string (obviously for some kinda storage reason), regardless of html tags, and then easily encode it back to html again when and if required.
    – suncat100
    Commented Apr 5, 2018 at 17:27
  • 1
    @callum is correct: the question asks about html entities, and you answer about uri components, which are completely different. Commented Aug 4, 2021 at 14:24

Not the answer you're looking for? Browse other questions tagged or ask your own question.