48

Looking at the XML header

<?xml version="1.0" encoding="UTF-16" standalone="no"?>

Am I right to state that the encoding attribute is

  • coming too late (you can't read it properly unless you know the encoding...)
  • redundant, hence error-prone: it's all too easy to replace it with "Big5" yet save the file in UTF-8

Or is that attribute not about the content of the stream?

Am I mixing up things here?

2

4 Answers 4

44

As you mentioned, you'd have to know the encoding of the file to read the encoding attribute.

However, there is a heuristic that can easily get you close enough to the "real" encoding to allow you to read the encoding attribute. This works, because the <?xml part by definition can only contain characters in the ASCII range (however they are encoded).

The XML standard even describes the exact process used to find out the encoding.

And the encoding label isn't redundant either. For example, if you use the algorithm in the XML spec to find out that some ASCII-based (or ASCII-compatible) encoding is used you still need to read the encoding to find out which one is actually use (valid candidates would be ASCII, UTF-8, any of the ISO-8859-* encodings, any of the Windows-* encodings, KOI8-R and many, many others). For the <?xml part itself it won't make a difference which one it is, but for the rest of the document, it can make a huge difference.

Regarding mis-labeled XML files: yes, it's easy to produce those, however: the XML spec clearly specifies that those files are mal-formed and as such are not correct XML. Incorrect encodings must be reported as an error (as long as they can be detected!). So it's the problem of whoever is producing the XML.

6

You're quite right that it looks like an odd design. It only works because the XML declaration uses only ASCII characters, and nearly all encodings are supersets of ASCII. If you're prepared to accept something that isn't, for example EBCDIC, you can check whether the file starts with whatever the EBCDIC representation of "<?xml" is. Which means you're relying on the general level of redundancy in the header of the file, rather than purely the encoding attribute itself. Like many things in XML, it's pragmatic and works, but isn't particularly elegant.

1
  • 2
    <?xml in EBCDIC 4C 6F A7 94 93. However, not all EBCDIC code pages encode " the same way: code page 1026 uses FC while most others use 7F, so you'd have to look for both.
    – dan04
    Commented Mar 2, 2011 at 14:13
3

XML parsers are only required to support at least UTF-8 and UTF-16. The XML parser starts by trying the encodings based on the Byte Order Mark (BOM), if present (for UTF-16, UTF-32 and even UTF-8 with the dummy BOM). If none is found, then the parser will try UTF-32, UTF-16, UTF-8, ASCII and other ASCII-compatible single-byte encodings. Only then will it see the encoding attribute, and will restart parsing if necessary.

1
  • 4
    I'd +1 if you could cite your sources.
    – Jason S
    Commented Feb 21, 2012 at 15:08
0

I think in principle you might have a point that the encoding statement is 'late' in the file, however, the whole first line only uses basic characters. AFAIK, those are the same in almost all encodings, so whatever you decode it as, it'll read <?xml ... ?> anyway.

Whatever comes after that however, could matter. For example text in a CDATA section could be encoded in a Cyrillic encoding.

1
  • 2
    that's not entirely true: UTF-16 (LE/BE), UCS-2, UCS-4 and EBCDIC are all legal encodings that don't encode those basic characters the same way as ASCII. However, the algorithm described in the XML spec gives good instructions on how to find out which encoding family is used. Commented Mar 2, 2011 at 9:17

Not the answer you're looking for? Browse other questions tagged or ask your own question.