57

So, HTML5 is the Big Step Forward, I'm told. The last step forward we took that I'm aware of was the introduction of XHTML. The advantages were obvious: simplicity, strictness, the ability to use standard XML parsers and generators to work with web pages, and so on.

How strange and frustrating, then, that HTML5 rolls all that back: once again we're working with a non-standard syntax; once again, we have to deal with historical baggage and parsing complexity; once again we can't use our standard XML libraries, parsers, generators, or transformers; and all the advantages introduced by XML (extensibility, namespaces, standardization, and so on), that the W3C spent a decade pushing for good reasons, are lost.

Fine, we have XHTML5, but it seems like it has not gained popularity like the HTML5 encoding has. See this SO question, for example. Even the HTML5 specification says that HTML5, not XHTML5, "is the format suggested for most authors."

Do I have my facts wrong? Otherwise, why am I the only one that feels this way? Why are people choosing HTML5 over XHTML5?

6
  • 9
    +1 I see that I'm not the only one frustrated with the loss of all XML advantages in HTML5. Commented Jun 24, 2011 at 11:02
  • Honking good question, well put. Commented Jun 24, 2011 at 11:33
  • 1
    I hope I'm not the only one who's glad with the loss of all XML's disadvantages in HTML5. For example, let's compare valid HTML5 to valid XHTML. HTML5: <!DOCTYPE html>Hello World, XHTML: <?xml version="1.0" encoding="iso-8859-1"?><!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "DTD/xhtml1-transitional.dtd"><html xml:lang="en" lang="en" xmlns="http://www.w3.org/1999/xhtml"><head><title></title></head><body>Hello World</body></html>
    – zzzzBov
    Commented Jun 24, 2011 at 20:49
  • @zzzzBov, you're most definitely not the only one who's glad, and that's why I asked this question in the first place. Also: you wouldn't seriously write <!DOCTYPE html>Hello World, would you? Try that on this validator. Commented Jun 27, 2011 at 14:49
  • 1
    @eegg, apparently you haven't read the spec on optional start tags, because I seriously would write <!DOCTYPE html>Hello World!, as it's perfectly valid HTML5. Shorter documents mean less bandwith overhead which equates to significant savings for large companies (have you seen what google sends for www.google.com?).
    – zzzzBov
    Commented Jun 27, 2011 at 19:34

5 Answers 5

26

I would recommend reading How Did We Get Here?. Mark Pilgrim gives an excellent and brief history of HTML up to HTML5.

Essentially though, my understanding is that many webpages don't even take advantage of the "X" of XHTML because they don't specify the proper MIME type for it.

5
  • 21
    Yeah. My summary of that story would be, "Hey, no-one's conforming to the specification. Maybe we could get them to conform to the specification by specifying that people can make any errors they want. Then finally all our documents will be error-free and standards-compliant." No good can come from writing a specification with the initial assumption that no-one respects specifications. Commented Jun 23, 2011 at 16:33
  • 1
    @eegg, your last line shows your ignorance to reality. Lots have good has already come from assuming nobody's perfect. Rather than the spec saying, "if you make any mistake, everything is broken" it instead says, "if you make [this type of mistake] then [this result] is what should happen". How many books would be on our shelves if they had to have 100% correct spelling, punctuation, and grammar for them to be published?
    – zzzzBov
    Commented Jun 24, 2011 at 20:48
  • 7
    @zzzzBov, your analogy with published books is weird. Why should an HTML parser be any more forgiving than a parser for [any other language here], where a syntax error is met with an error message? Imagine the chaos we would be in if our C compilers tried their best to silently reinterpret broken syntax. Commented Jun 27, 2011 at 14:44
  • @eegg, i can image what would happen if a parser for any other language reacted to syntax errors in a more forgiving manner: we would spend less time hunting down misplaced brackets and missing semi colons and more time typing functional code. I'm not saying that good programmers wouldn't still make their programs well-formed, but it would certainly help mediocre programmers write working code. A C program would probably end up looking much more similar to a Python program in that the semi-colons and brackets could mostly disappear, and what would be left is the important code.
    – zzzzBov
    Commented Jun 27, 2011 at 19:17
  • “The requested resource /past.html is no longer available on this server and there is no forwarding address.”
    – Marco
    Commented Sep 16, 2012 at 16:40
7

If you produce xml compatible html5, and send them with xml as mime type, then the xml parser will be used all all that good jazz comes back ;)

EDIT: see that for some more informations : http://wiki.whatwg.org/wiki/HTML_vs._XHTML

3
  • Define "good jazz". AFAIK there's no advantage to parsing HTML as XML. Generating and transforming are other matters, those can be convenient, but the parsing by itself doesn't offer advantages, only downsides (it makes cosmetic bugs fatal). Commented Jun 24, 2011 at 10:13
  • 3
    @Joeri The fact that it’s vastly easier to parse is an advantage in my book, for a variety of reasons (strict parsing facilitates the finding of errors, better tool support because tools are easier to write, easier sanitising of input, etc.). Commented Jun 24, 2011 at 11:34
  • You can also provide some functionnality unavailable in standard html, like micin xhtml with other xml contents, and generaly use all xml functionnalities, namespaces for exemple. html parser are able to fix bad source code - cosmetic bugs as you call them-, but those fixes has a price. The price is that the browser needs to know what it is likely to find in the code, thus limiting the availables functionnalities.
    – deadalnix
    Commented Jun 24, 2011 at 12:26
3

HTML5 is the logical and inevitable conclusion of browsers adopting Postel's law ("Be liberal in what you accept").

Once one browser with sufficient market share adopts this principle, others are forced to follow suit, not only in being liberal by accepting non-conforming content, but also rendering it the same way as their competitors do. HTML5 is the logical result of that situation: the browser vendors have decided that since they're not going to reject any content as invalid (at least, not at the HTML level - Javascript is another matter!) they might as well sit round the table and agree an interpretation for anything the content author might throw at them. In this environment, they haven't reacted kindly to standards-geeks telling them that if only they had rejected ill-formed content from the word go, they wouldn't have got into this mess.

So you and I can shout from the sidelines and tell the browser vendors and their users that the world would have been a better place if they hadn't believed John Postel, but the damage is done and it's very hard to undo it.

3
  • 5
    The story of browsers' competing sloppiness is true enough. But here's the thing: that's why the standards-geeks exist. If all browsers had enforced the straight and narrow from the start, organizations like the W3C wouldn't need to be here to keep things under control. The whole point of the standards is damage-control; for the standards body to give in and accept sloppiness defeats its very purpose. Commented Jun 24, 2011 at 9:53
  • 1
    @eegg: HTML5 redefines the parsing rules to make all input valid and still have predictable consequences. If syntax errors are impossible, a whole class of bugs are ruled out from the start. XML's ability to have parse errors is a design flaw, and should be recognized as such. Commented Jun 24, 2011 at 10:18
  • 2
    @Joeri, your position seems to be that of the HTML5 spec, taken to its insane logical conclusion. "HTML5 redefines the parsing rules to make all input valid" -- it doesn't. The concept of parsing errors still exists. "If syntax errors are impossible, a whole class of bugs are ruled out from the start" -- maybe this is parody? This logic is what I sarcastically paraphrased in my comment to @pthesis' answer. Yes, the class of syntax errors is removed, to be replaced by a larger class of browser syntax correction errors. Commented Jun 27, 2011 at 14:58
2

The HTML5 specification has actually been greatly improved over the HTML4 specification. In particular, the handling of error conditions and invalid markup is actually standardized, meaning all browsers that correctly implement the standard will handle invalid markup in the same way.

HTML is written by humans more often than not (usually in conjunction with some kind of templating language), and humans make mistakes. As long as all browsers handle syntax errors in the same way, then the "be liberal in what you accept" rule is perfectly acceptable.

There is really little advantage in producing valid XML, since tools and libraries to handle HTML are (nearly) just as readily available, and HTML is easier for humans to write than XML.

1
  • Over the HTML4 specification, yes. But my point is that XHTML1.1 already improved on that. Tools/libraries to handle HTML tend to be like BeautifulSoup -- while wonderful tools, they should die along with the pages they were made to parse. Commented Jun 27, 2011 at 15:03
2

You will never get the benefits of a simpler parser or standard XML tools on the client side anyway.

There are billions of pages on the web in HTML, some of them are written by people long dead, so they are never going to be updated to XML. So if you want to create a generally useful user agent you have to be able to parse old fashioned HTML anyway. Arguably XHTML only introduces additional complexity since it requires a new mode of parsing in addition to the HTML parsing you already have to support.

On the server side you can still take advantage of XML tools by eg. generating XHTML using XSLT. But if you are not specifically using a XML toolchain, there is no benefit in using XML syntax rather than just HTML.

(You are not correct that HTML is "non standard" syntax. The syntax of HTML is specified in painstaking detail in the HTML5 spec, so it is just as much a standard as XML syntax.)

Not the answer you're looking for? Browse other questions tagged or ask your own question.