1148

What characters must be escaped in XML documents, or where could I find such a list?

6
  • 12
    Example: <company>AT&amp;T</company>
    – jacktrades
    Commented Dec 5, 2012 at 19:47
  • 2
    See Simplified XML Escaping below for a concise and easily remembered guide that I've distilled from primary sources (W3C Extensible Markup Language (XML) 1.0 (Fifth Edition)).
    – kjhughes
    Commented Feb 14, 2018 at 16:40
  • 2
    Literally none of the answers here are correct. You also must escape many various control characters in XML 1.1.
    – Jason C
    Commented May 4, 2021 at 18:27
  • 1
    @JasonC: Understanding the question as intended rather than literally is ideal. If you feel future readers would benefit from an elaboration of how to specify control characters in XML, please elaborate in an answer. Thanks.
    – kjhughes
    Commented Dec 3, 2021 at 16:52
  • @kjhughes With the question being interpreted as intended, literally none of the answers here are correct. You also must escape many various control characters in XML 1.1, as outlined here. See also XML 1.1 §4.1, §4.4, §4.6, and Appx. C for specific details and restrictions.
    – Jason C
    Commented Dec 3, 2021 at 20:26

10 Answers 10

1663

If you use an appropriate class or library, they will do the escaping for you. Many XML issues are caused by string concatenation.

XML escape characters

There are only five:

"   &quot;
'   &apos;
<   &lt;
>   &gt;
&   &amp;

Escaping characters depends on where the special character is used.

The examples can be validated at the W3C Markup Validation Service.

Text

The safe way is to escape all five characters in text. However, the three characters ", ' and > needn't be escaped in text:

<?xml version="1.0"?>
<valid>"'></valid>

Attributes

The safe way is to escape all five characters in attributes. However, the > character needn't be escaped in attributes:

<?xml version="1.0"?>
<valid attribute=">"/>

The ' character needn't be escaped in attributes if the quotes are ":

<?xml version="1.0"?>
<valid attribute="'"/>

Likewise, the " needn't be escaped in attributes if the quotes are ':

<?xml version="1.0"?>
<valid attribute='"'/>

Comments

All five special characters must not be escaped in comments:

<?xml version="1.0"?>
<valid>
<!-- "'<>& -->
</valid>

CDATA

All five special characters must not be escaped in CDATA sections:

<?xml version="1.0"?>
<valid>
<![CDATA["'<>&]]>
</valid>

Processing instructions

All five special characters must not be escaped in XML processing instructions:

<?xml version="1.0"?>
<?process <"'&> ?>
<valid/>

XML vs. HTML

HTML has its own set of escape codes which cover a lot more characters.

12
  • 43
    @Pacerier, I beg you not to write your own XML/HTML escaping code. Use a library function or you're bound to miss a special case.
    – Jason
    Commented Mar 16, 2012 at 9:23
  • 8
    Also for line breaks you need to use &#xA; &#xD; and &#x9; for tab, if you need these characters in an attribute.
    – radistao
    Commented Nov 26, 2012 at 22:33
  • 91
    If you're going to do a Find/Replace on these, just remember to do the &amp; replacement before the others.
    – Doug
    Commented Jun 15, 2013 at 21:29
  • 3
    @Doug I was just about to mention the exact same thing - or else all other replaced characters will be corrupted, and things like &quot; will be changed to &amp;quot; Commented Aug 5, 2013 at 22:23
  • 11
    From Wikipedia: "All permitted Unicode characters may be represented with a numeric character reference." So there are a lot more than 5.
    – Tim Cooper
    Commented Aug 15, 2014 at 7:47
108

New, simplified answer to an old, commonly asked question...

Simplified XML Escaping (prioritized, 100% complete)

  1. Always (90% important to remember)

    • Escape < as &lt; unless < is starting a <tag/> or other markup.
    • Escape & as &amp; unless & is starting an &entity;.
  2. Attribute Values (9% important to remember)

    • attr=" 'Single quotes' are ok within double quotes."
    • attr=' "Double quotes" are ok within single quotes.'
    • Escape " as &quot; and ' as &apos; otherwise.
  3. Comments, CDATA, and Processing Instructions (0.9% important to remember)

    • <!-- Within comments --> nothing has to be escaped but no -- strings are allowed.
    • <![CDATA[ Within CDATA ]]> nothing has to be escaped, but no ]]> strings are allowed.
    • <?PITarget Within PIs ?> nothing has to be escaped, but no ?> strings are allowed.
  4. Esoterica (0.1% important to remember)

11
  • 3
    One other rule worth noting: ]]> must be escaped as ]]&gt;, even when not in a CDATA section. The easiest way of achieving that may be to always escape > as &gt;. Commented May 29, 2018 at 15:24
  • Thanks, @MichaelKay. I've incorporated your helpful note about ]]> but chose to relegate it to esoterica rather than suggesting that > always be escaped (which it needn't be, as you know). My goal here to make the XML escaping rules easily remembered and 100% accurate.
    – kjhughes
    Commented Jun 3, 2018 at 14:01
  • The above answers including accepted one mention all five characters should be escaped inside attributes. Do you have any reference to XML standard to back what you are saying as your answer logically seems to be the correct one?
    – Roman Susi
    Commented Feb 7, 2020 at 5:49
  • 3
    @RomanSusi: Yes, many other answers contain errors or overgeneralizations ("The safe way...") based on hearsay, misinterpretation, or misunderstanding of the official XML BNF. My answer is (a) 100% justified by W3C XML Recommendation; see the many linked references to the official BNF, and (b) organized in a concise, logical, and easily remembered progression of those requirements.
    – kjhughes
    Commented Feb 7, 2020 at 13:44
  • 1
    I think I should change my future first child name from Felipe to ";'Felipe]]><PLAINTEXT> <!-- and see what happens to most websites Commented Nov 18, 2020 at 8:29
99

Perhaps this will help:

List of XML and HTML character entity references:

In SGML, HTML and XML documents, the logical constructs known as character data and attribute values consist of sequences of characters, in which each character can manifest directly (representing itself), or can be represented by a series of characters called a character reference, of which there are two types: a numeric character reference and a character entity reference. This article lists the character entity references that are valid in HTML and XML documents.

That article lists the following five predefined XML entities:

quot  "
amp   &
apos  '
lt    <
gt    >
85

According to the specifications of the World Wide Web Consortium (w3C), there are 5 characters that must not appear in their literal form in an XML document, except when used as markup delimiters or within a comment, a processing instruction, or a CDATA section. In all the other cases, these characters must be replaced either using the corresponding entity or the numeric reference according to the following table:

Original CharacterXML entity replacementXML numeric replacement
<                              &lt;                                    &#60;                                    
>                              &gt;                                   &#62;                                    
"                               &quot;                               &#34;                                    
&                              &amp;                               &#38;                                    
'                               &apos;                               &#39;                                    

Notice that the aforementioned entities can be used also in HTML, with the exception of &apos;, that was introduced with XHTML 1.0 and is not declared in HTML 4. For this reason, and to ensure retro-compatibility, the XHTML specification recommends the use of &#39; instead.

4
  • 18
    XML predefines those five entities, but it absolutely does NOT specify that you can't use any of those five characters in their literal form. < and & have to be escaped everywhere (except CDATA). " and ' only have to be escaped in attribute values, and only if the corresponding quote character is the same. And > never actually has to be escaped. Commented Aug 24, 2013 at 13:58
  • 3
    As written above, < > " & ' do not have to be escaped when used as markup delimiters or within a comment, a processing instruction, or a CDATA section. i.e. when you use < > as an XML tag you don't escape it. Same thing for a comment (would you escape an & in a commented line of a XML file? You don't need to, and your XML is still valid if you don't). This is clearly specified in the official recommendations for XML by W3C.
    – Albz
    Commented Oct 1, 2013 at 7:21
  • 7
    @ShaunMcCance > must be escaped if it follows ]] within content, unless it's intended to be part of the ]]> delimiter that indicates the end of a CDATA section.
    – Lee D
    Commented Apr 25, 2014 at 17:45
  • 3
    Not to be a necromancer, but @Albz is incorrect in saying that these characters MUST be entitized in content. See section 2.4 at w3.org/TR/REC-xml/#NT-CharData. The TL;DR version of that is that in chardata element content, &amp; and &lt; have to always be entitized. The &gt; character MAY be entitized, although it MUST be when appearing in the literal string “]]>” because otherwise that will be read as ending a CDATA section. For single-quote and double-quote, you can escape if you want to. That's it, for chardata inside elements. Other components of XML have other rules.
    – chris
    Commented May 3, 2016 at 17:52
54

Escaping characters is different for tags and attributes.

For tags:

 < &lt;
 > &gt; (only for compatibility, read below)
 & &amp;

For attributes:

" &quot;
' &apos;

From Character Data and Markup:

The ampersand character (&) and the left angle bracket (<) must not appear in their literal form, except when used as markup delimiters, or within a comment, a processing instruction, or a CDATA section. If they are needed elsewhere, they must be escaped using either numeric character references or the strings " &amp; " and " &lt; " respectively. The right angle bracket (>) may be represented using the string " &gt; ", and must, for compatibility, be escaped using either " &gt; " or a character reference when it appears in the string " ]]> " in content, when that string is not marking the end of a CDATA section.

To allow attribute values to contain both single and double quotes, the apostrophe or single-quote character (') may be represented as " &apos; ", and the double-quote character (") as " &quot; ".

1
  • 1
    This implies that for attributes only quotes need to be escaped, but that is in addition to the other three characters
    – eug
    Commented Jul 5, 2018 at 4:46
29

In addition to the commonly known five characters [<, >, &, ", and '], I would also escape the vertical tab character (0x0B). It is valid UTF-8, but not valid XML 1.0, and even many libraries (including the highly portable (ANSI C) library libxml2) miss it and silently output invalid XML.

0
14

Abridged from: XML, Escaping

There are five predefined entities:

&lt; represents "<"
&gt; represents ">"
&amp; represents "&"
&apos; represents '
&quot; represents "

"All permitted Unicode characters may be represented with a numeric character reference." For example:

&#20013;

Most of the control characters and other Unicode ranges are specifically excluded, meaning (I think) they can't occur either escaped or direct:

Valid characters in XML

7

The accepted answer is not correct. Best is to use a library for escaping xml.

As mentioned in this other question

"Basically, the control characters and characters out of the Unicode ranges are not allowed. This means also that calling for example the character entity is forbidden."

If you only escape the five characters. You can have problems like An invalid XML character (Unicode: 0xc) was found

2
4

It depends on the context. For the content, it is < and &, and ]]> (though a string of three instead of one character).

For attribute values, it is <, &, ", and '.

For CDATA, it is ]]>.

-9

Only < and & are required to be escaped if they are to be treated character data and not markup:

2.4 Character Data and Markup

0

Not the answer you're looking for? Browse other questions tagged or ask your own question.