What use is the 'encoding' in the XML header?

Question

Looking at the XML header

<?xml version="1.0" encoding="UTF-16" standalone="no"?>

Am I right to state that the encoding attribute is

coming too late (you can't read it properly unless you know the encoding...)
redundant, hence error-prone: it's all too easy to replace it with "Big5" yet save the file in UTF-8

Or is that attribute not about the content of the stream?

Am I mixing up things here?

Yes, it would create a problem with UTF-16. Would work for UTF-8 though, since it is backwards compatible with ASCII. Good question! — PatrikAkerstrand, Commented Mar 2, 2011 at 9:09

Joachim Sauer · Accepted Answer · 2013-11-20 08:19:57Z

As you mentioned, you'd have to know the encoding of the file to read the encoding attribute.

However, there is a heuristic that can easily get you close enough to the "real" encoding to allow you to read the encoding attribute. This works, because the <?xml part by definition can only contain characters in the ASCII range (however they are encoded).

The XML standard even describes the exact process used to find out the encoding.

And the encoding label isn't redundant either. For example, if you use the algorithm in the XML spec to find out that some ASCII-based (or ASCII-compatible) encoding is used you still need to read the encoding to find out which one is actually use (valid candidates would be ASCII, UTF-8, any of the ISO-8859-* encodings, any of the Windows-* encodings, KOI8-R and many, many others). For the <?xml part itself it won't make a difference which one it is, but for the rest of the document, it can make a huge difference.

Regarding mis-labeled XML files: yes, it's easy to produce those, however: the XML spec clearly specifies that those files are mal-formed and as such are not correct XML. Incorrect encodings must be reported as an error (as long as they can be detected!). So it's the problem of whoever is producing the XML.

Michael Kay · Accepted Answer · 2011-03-02 11:57:01Z

6

You're quite right that it looks like an odd design. It only works because the XML declaration uses only ASCII characters, and nearly all encodings are supersets of ASCII. If you're prepared to accept something that isn't, for example EBCDIC, you can check whether the file starts with whatever the EBCDIC representation of "<?xml" is. Which means you're relying on the general level of redundancy in the header of the file, rather than purely the encoding attribute itself. Like many things in XML, it's pragmatic and works, but isn't particularly elegant.

answered Mar 2, 2011 at 11:57

Michael Kay

161k11 gold badges93 silver badges168 bronze badges

2

<?xml in EBCDIC 4C 6F A7 94 93. However, not all EBCDIC code pages encode " the same way: code page 1026 uses FC while most others use 7F, so you'd have to look for both.
– dan04
Commented Mar 2, 2011 at 14:13

Add a comment |

trlkly · Accepted Answer · 2012-12-19 16:44:49Z

3

XML parsers are only required to support at least UTF-8 and UTF-16. The XML parser starts by trying the encodings based on the Byte Order Mark (BOM), if present (for UTF-16, UTF-32 and even UTF-8 with the dummy BOM). If none is found, then the parser will try UTF-32, UTF-16, UTF-8, ASCII and other ASCII-compatible single-byte encodings. Only then will it see the encoding attribute, and will restart parsing if necessary.

edited Dec 19, 2012 at 16:44

trlkly

7217 silver badges14 bronze badges

answered Mar 2, 2011 at 9:08

Delan Azabani

80.9k29 gold badges172 silver badges212 bronze badges

4

I'd +1 if you could cite your sources.
– Jason S
Commented Feb 21, 2012 at 15:08

Add a comment |

Zsub · Accepted Answer · 2011-03-02 09:14:33Z

0

I think in principle you might have a point that the encoding statement is 'late' in the file, however, the whole first line only uses basic characters. AFAIK, those are the same in almost all encodings, so whatever you decode it as, it'll read <?xml ... ?> anyway.

Whatever comes after that however, could matter. For example text in a CDATA section could be encoded in a Cyrillic encoding.

edited Mar 2, 2011 at 9:14

answered Mar 2, 2011 at 9:09

Zsub

1,7892 gold badges16 silver badges28 bronze badges

2

that's not entirely true: UTF-16 (LE/BE), UCS-2, UCS-4 and EBCDIC are all legal encodings that don't encode those basic characters the same way as ASCII. However, the algorithm described in the XML spec gives good instructions on how to find out which encoding family is used.
– Joachim Sauer
Commented Mar 2, 2011 at 9:17

Add a comment |

Collectives™ on Stack Overflow

What use is the 'encoding' in the XML header?

4 Answers 4

Not the answer you're looking for? Browse other questions tagged
xml
header
character-encoding
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Not the answer you're looking for? Browse other questions tagged xmlheadercharacter-encoding or ask your own question.

Linked

Related

Not the answer you're looking for? Browse other questions tagged
xml
header
character-encoding
or ask your own question.