24

I started my web design career back in the late 90s, and at that time web page technologies seemed to be in a state of flux. In particular, I remember a transitional period prior to widespread adoption of XHTML when some major web sites started deploying XML web pages that wouldn't render on people's machines due to poor browser support.

I'm familiar with the technologies involved, but what I would like to know is what factors lead to the adoption, then abandonment, of XML and XSLT for web pages? Was this a case of the technology being too open-ended and too complicated for people to deal with, and XHTML became a sensible compromise? XML is of course still ubiqitious, although I suspect that JSON has stolen some market share from it, but some of the issues that HTML5/CSS3 is trying to solve now seem to have been solved 25 years ago by XML/XSLT. I'm aware that HTML, XHTML and HTML5 share a clear lineage with XML and SGML and are directly derived from them, but I don't understand what the benefits were in losing the features that made XML and XSLT so capable.

I'll end my question with a quote from one of my favorite technology journalists, Shashank Sharma, who when challenged to find a solution to a particularly difficult problem replied:

I'm sure there's some way to do this using XML. Every problem in the world can be solved with XML.

22
  • 11
    Just a comment to the "Everything can be solved in XML": I once ran into some test equipment that was able to modify some data structure as human-understandable tree with plaintext decoded properties before being encoded to binary data structure for the tested equipment. The only way of exporting/saving data for latet use was XML, but it simply contained the binary dump as a hexadecimal string. Which was useless, as just having the binary dump as plain binary file would have been a far more useful format, not requiring conversions between the XML and any other format you needed.
    – Justme
    Commented Apr 28 at 15:27
  • 10
    Wow… XML/XSLT and XHTML. It’s been nearly 20 years since I thought about any of that. I don’t think XML lost out to XHTML because in my experience with web services XML continues to carry on. Sure JSON has taken a big bite out of XML’s traction but XML is still being used in legacy systems and I run into API endpoints that still use it. I think XHTML lost favour with front end developers to HTML5 because XHTML didn’t offer anything useful that HTML5 could do and HTML5 wasn’t as strict. Full disclosure: I spent a lot of time studying the intricacies of XML/XSLT. I really enjoyed that. Commented Apr 28 at 16:15
  • 24
    Isn't XHTML just an application of XML? i.e., if a structure is XHTML then by definition it is also XML.
    – dave
    Commented Apr 28 at 16:37
  • 12
    My pet theory: XHTML lost to HTML5 mostly because (a) Internet Explorer did not support XHTML for a long time; (b) Firefox, which did support XHTML, did not support progressive rendering; (c) people were afraid of the Yellow Screen of Death, where a single syntax error could prevent a page from rendering at all, while there was no tooling to generate well-formed XHTML – HTML was generated by string concatenation without paying much attention to correct syntax, leaving it up to the browser to fix up errors. I am not sure I can substantiate much of this with sources, though. Commented Apr 28 at 19:59
  • 9
    What, no-one's mentioned ‘XML is like violence’?  “XML is like violence: if it doesn’t solve your problem, you aren’t using enough of it.”  — attrib. to ‘slashdot via rusty’.  “XML is like violence.  Sure, it seems like a quick and easy solution at first, but then it spirals out of control into utter chaos.”Sarkos
    – gidds
    Commented Apr 29 at 15:25

11 Answers 11

44

XML didn't lose out to XHTML as you indicate in the question title, it has lost out to JSON as the transport format.

That said, in order to be anything but structured data (which is what an XML payload is) it needs to be converted to something the browser can render. The intended technology to do this was XSLT stylesheets, which can do some truly amazing things with said structured data (look up axis) including converting it to an HTML document which the browser can then render.

Unfortunately at the time the focus on adhering to standards was not as high as it was today, and the browsers never got to a point where this just worked as well as just shipping plain HTML in the first place.

XHTML in turn didn't do anything HTML couldn't do, except being valid XML which could be checked by a machine, so it was a "that's nice but no real value" bit. The efforts were put in Javascript because that was what enabled actual webapps as we know them today instead of semi-static pages that you navigated through and the server generated what you had to see.

This was then formalized as HTML5.

So it is not as much losing out to as not getting enough traction in the first place, because it didn't solve the problems that the web developers had.

12
  • 4
    Your comment about "that's nice but no real value" is exactly what I was thinking. I think XHTML was not fully exploited or (again) maybe it was too open-ended for people, or not exploited well enough. I remember spending time learning about SGML, XML and XSLT and thinking that it was a really good way to do things, especially with having all your content in one place and rendering it out to many different types of formats, and in the meantime modern browsers are still struggling to render simple web pages as printed documents :/ Commented Apr 28 at 18:53
  • 18
    Plus, it became a religion, to its detriment - preferring to go 110% XML even when it didn't make sense. The idea of having a functional programming language as a transform mechanism made a lot of sense. The idea that that functional programming language must itself be encoded as XML (i.e., XSLT) (with no defined "standard" syntax) was brain-dead stupid, and only someone who drank the kool-aid by the gallon would ever have thought of promoting it. (And by promoting it, of course, they closed off other avenues that could have been accepted to the benefit of the XML ecosystem.)
    – davidbak
    Commented Apr 28 at 19:21
  • 3
    @Nelson It has to do with fiendishly complicated HTML parsing rules, which are full of exceptions and implicit state transitions that make some DOM trees impossible to express as markup. In XHTML, every DOM tree has an unambiguous serialization, which is lossless (up to text node normalization). Round-tripping serialization through XHTML instead of HTML would have absolutely avoided the problem. Commented Apr 29 at 2:01
  • 2
    XML is a terrible serialization format because it's a markup language, not a serialization format. Commented Apr 30 at 16:02
  • 1
    @davidbak Before XSLT, there were multiple transformation languages for SGML and, by extension, for XML, including ones based on Scheme or Tcl, but none of them had the traction that XSLT had (and, obviously, none have had the longevity). XSLT, at that point in the hype curve/wave, got a boost from being a W3C effort, but there are advantages in being able to use the same editor for documents and styles, and when you've drunk enough Kool-Aid, it's convenient to be able to generate XSLT using XSLT. None of any individual's efforts at a compact syntax for XSLT have ever had any traction. Commented Apr 30 at 16:13
31

As someone who was also around at the time, my impression was that XML+XSLT and XHTML lost out to HTML5 for two reasons that are inherent in trying to replace HTML, the SGML dialect, with HTML, an expression of the SGML dialect known as XML:

  1. There's too much ecosystem built around assuming the robustness principle ("be conservative in what you do, be liberal in what you accept from others") and people considered it far too much of a downside for far too little upside to serve XHTML with the mimetype that would actually trigger XHTML parsing because that would mean a single un-sanitized or typo'd byte could result in the entire page being inaccessible, with the user staring at a parse error.

    If it was common enough to find invalid RSS feeds that you needed something permissive like the Universal Feed Parser module for Python, what hope did being strict in a language that people habitually used string processing to template have?

    (During that period, I did actually look into templating engines that would enforce well-formedness at load time (Genshi being one I remember) as part of what would eventually develop into a love for Rust's compile-time strictness, but didn't find anything that wasn't at least an order of magnitude slower.)

  2. XHTML 2 tried to break compatibility with previous HTML and XHTML versions, with things like replacing <br> with <l>this is an explicit line</l>... something which the ecosystem at large rejected.

My impression at the time was that the development of HTML 5 and the introduction of the concept of HTML being a "living standard" was sort of akin to the migration from XFree86 to X.org, from OpenOffice.org to LibreOffice, or the blessing EGCS of as GCC 2.95... a vote of no confidence.

(Bear in mind that this also occurred back before Firefox switched to a fast release cadence with version 4, so there was an element of "slow-moving and out of touch development causing breakages" in people's minds, similar to how The GIMP is prone to being thought of to this day.)

XHTML and XML+XSLT came out of the same school of thought which was also enamoured with RDF and the Semantic Web. Those, in turn, were supplanted by the much more natural-for-the-web ecosystem of microformats that we've seen uptake on by consumers like Google Search.

8
  • 2
    It seems like there was/is still that mentality of non-compliant HTML being perfectly valid and the browser is supposed to make sense of complete garbage (such as browsers trying to parse two sets of HTML documents output to the same page). I find it odd that the browser developers were happy with that situation to continue indefinitely. The example you mention about <br> and (from my perspective) going from <br> to <br /> and back to <br> again seems to violate the whole concept of the markup system, like they don't know or can't figure out how to represent a line break in XML. Commented Apr 28 at 23:25
  • 27
    "I find it odd that the browser developers were happy with that situation" - they were not happy, but if their browser could not make sense of complete garbage, but other browser could, their browser will lose market share. Fortunately, after a while, they standardized the exact way of making sense of complete garbage, and everyone is happy now :)
    – artem
    Commented Apr 29 at 0:07
  • 3
    There was too much ecosystem built around assuming only the second part of the robustness fallacy, conveniently forgetting about the first part. Commented Apr 29 at 2:05
  • 1
    Also, microformats are pretty much dead these days. Tailwind, visual-only markup and obfuscated CSS class names, here we come! Commented Apr 29 at 2:06
  • 7
    You have to remember that the browser is an agent of its user, who is generally a person who simply wants to read a web page. Like all software we should expect it to attempt to do whatever its user wants. For an informational web page (although not necessarily a web app) what the user wants it to do is to make as much sense as it can of whatever is thrown at it. The reader would prefer to see something, even if partially garbled, than nothing. The chance that they see something that looks correct but is factually misleading due to this browser behavior is very low.
    – bdsl
    Commented Apr 29 at 12:55
27

XML was caught in the cross-fire of the browser wars.

There was a great bun-fight in the 2000s between the browser vendors and W3C over control of the HTML standard. I don't really know the political background to that, or why it was that the browser vendors sidelined W3C and went their own way, but XML was the victim - it was perceived by the browser vendors as a W3C technology that gave W3C control over the evolution of the web, and they didn't like that, so they sidelined it. With the browser being essentially a closed platform with little opportunity for independent 3rd party technologies, they effectively froze XML out.

Part of the bun-fight was a question of technical ideology: do you get a robust web by rejecting content that doesn't conform to standards, or do you get a robust web by accepting anything that anyone cares to produce? I don't think that was the whole story though.

Another part of the problem was the difficulty of producing a good XSLT processor for the browser environment, in particular one that fitted the browser/Javascript model of asynchronous resource access. The browser vendors were reluctant to make that investment, instead adopting a third-party processor (libxslt) that was never designed for that environment, and that was developed and maintained by the metaphorical "some guy in Nebraska", and presented an integration headache. For a long time during the 2000s you couldn't rely on the browser having an XSLT processor built-in at all, and without that, you couldn't deploy a web site that relied on client-side XSLT processing; and that in turn led to a lack of investment in the technology.

An alternative angle is that the browser wars in the 2000s were essentially a battle between Microsoft, who had dominant market share, and independent challengers such as Mozilla Firefox and Opera. Microsoft had invested heavily in XML technology, the other players had difficulty competing with that; so it was in the interests of the contenders to downplay the importance of XML.

And you can add to that the fact that XSLT is an extremely powerful technology but one with a sharp learning curve. A lot of people take one look at it and are frightened off. Few languages are loved more by their users or hated more by their non-users.

1
  • 10
    Just for the record. Michael Kay wrote the XSLT processor Saxon which was the go-to implementation for me for almost a decade. He also made quite an effort solving the scaling problem in that XSLT quietly assumes you can hold the whole input document in memory (which for the Java implementation at the time required roughly ten times as much memory as the size of the file on disk) by handling the XML as a stream requiring much less memory, but that was only in the commercial version which we - alas - didn't need enough for us to purchase. If anyone knows this bit of history, it's him 🙂 Commented Apr 29 at 20:52
11

To look at why these technologies might, or might not be, adopted, we need to think about who would be adopting them. Mostly that is browser developers and website creators / developers.

For XHTML, the main difference to classic HTML was strictness and most XHTML documents were also valid HTML. This made it simpler for browser developers, but they could not ignore the older plain HTML parsing code as it was still necessary to render any site that had not made the transition. For a website creator there was very little pressure to produce the site in XHTML as the main difference was that a browser could refuse to render an invalid XHTML document, whereas an HTML document with errors would probably still show the content, but would have style errors.

XML/XSLT is a slightly different story, The idea of using XML/XSLT was to produce webpages based on dynamic data. There were three main ways to do this:

  • Have the entire webpage created on the server programmatically writing HTML (possibly using some sort of template engine)
  • AJAX: have a fixed page, and then update the HTML in the browser after a separate data fetch
  • XML/XSLT: Fetch the data from an XML API, and then combine with an XSLT to generate an HTML page for rendering in the browser.

All of these have their own pros and cons, but for XML/XSLT there are many additional cons:

  • Significant extra work in the browser to support XML and XSLT
  • The website creator must work simultaneously in three separate languages (HTML,XML, and XSLT) all of which are superficially similar
  • The XSLT file contains the HTML markup, spread among the XSLT processing. This is very hard to read
  • A small mistake will lead to an invalid document, and no render
  • It is difficult to combine data from multiple sources on one page

So while XML/XSLT may have had some technical superiority, the usability factors were likely its downfall

1
  • 2
    The design didn’t take into account that JavaScript would be leveraged to do all these things in the browser. Commented Apr 29 at 18:13
7

I'll add my thoughts on this, as they don't fit neatly into a comment. I was around through the same time and only have an outsider's perspective. Several of the comments already hit the overall issue on it's head "the user just wants to see content", and therefore breaking a bunch of existing sites is generally bad, although I think there is a bit more to it.

Early HTML was ugly, and obviously there is a lot of "ugly" structures still floating around. When people see something like that, they often want to clean it up. However, interoperability, backwards compatibility, and developer flexibility are all important. It's also important to remember the FOSS roots of many web tools.

XML - I recall a big push into this, mostly it seemed to be lead by some standards bodies and vendors. The problem with XML is what to do with it - you can't present it to users "as is" so you need to transform it. XSLT was the proposed answer, but you either needed to render entirely on the server or have the browser try to render (and initial browser based XSLT often relied on vendor specific extensions, I recall there were a number of cases that I had which only worked in IE). As a developer my impression of XSLT was that it wasn't easy to work with (especially v1.0 had a few gaps) and if running in the browser could easily lead to a blank page and sometimes little in the way of debugging support.

Then we get to XML Schema. I don't know percentages, but it feels to me that use has declined (this may be due to JSON, or might just be due to developer discontent). XML Schema was pretty obviously highly data type driven, and seemed to fit well with a "tools first" approach. IMO, this ran into issues with an open web as the tooling was mostly proprietary - I know I was always working with tools that required a good bit of manual work to edit the schema. I think a huge amount of work by very smart people went into XML Schema, but it just ended up trying to push a somewhat narrow and data centric view. As a contrast, I briefly looked into RelaxNG schema - it felt much more natural but at the time a lot of tools demanded a .xml extension and at the time RelaxNG declined to propose any type of common processing instruction or other in file marker.

XHTML Obviously this is defined as XML so saying it "beat" XML is a bit misleading. Although I think I understand, in that there were some proposals to serve raw XML directly to the browser (either loaded directly or via JS) and use XSLT to render something to the user; while XHTML attempted to redifine HTML from a more loose SGML spec to a stricter XML based spec. On it's own I think this was a bit of a problem, e.g. browsers might need two rendering modes depending on the type, and at the time CPU and memory was much dearer than it is now, however given time an XHTML standard might eventually have become the norm. However, then a number of other attempts to standardized various aspects such as XForms etc.. were being introduced around the same time, RDF was being pushed. At least to me this seemed not only like overkill (taking a fairly simply form and adding a large amount of definition just to get something simple) but likely an attempt to lock developers into expensive tooling to generate, edit, and debug. I'm not sure how other perceived this, but that was my impression.

HTML5 While the standard is essentially stand alone, it's hard to imagine this without CSS3 and a modern DOM. It's imperfect, but has made backward compatibility a fairly central goal as well as generally being IMO more tool agnostic. JSON is somewhat obviously something of an organic development and as such as it's own issues, but at least it's easy to author and template - and as validation of submitted data needs to happen on the server it's ok to not always push a schema out to the client.

So to wrap up - audience is important. The primary audience is the user who just wants to navigate the web and doesn't care about format. The second are the content writers and web developers who are using the formats, if they don't perceive value relative to effort then uptake is likely to be slow and other solutions will have a chance to take hold.

1
  • 1
    Some very good points there, particularly about the proprietary nature of the tools available and the complexity of XSLT. Commented Apr 29 at 16:48
7

I'd say the question is base on a (well two) slight misconceptions and based on memory clouded lots by marketing smoke made to calm popular opinion.

So let's have a look:

XML

The whole point of XML was lay down the rules how markup is to be structured. Goal was to allow writing of extreme simple parsers for reading - and likewise simple standard tools for creation.

XML was never a definition of web page markup language nor did it even try. It's 100% structure and 0% semantics. In fact, XML on its own is not good for anything. What it is, is a rule book about how to create arbitrary special purpose languages. It may be best viewed as a simplified and cut down version of SGML, just describing the grammar, but not not any semantics of any language based thereon. Therefore it could not lose out to (X)HTML(5), as it was not intended or able to do so.

The main problem with XML is that it was never really intended to be used by application level programmers, aka the coding guy, but it was presented to them without really pointing out the difference. So all they saw was stuff with no relation to their daily job.They work on a very non abstract level. They do not want to know how thy could create new elements for new languages. All they want is an element to make things blink, scroll, change font and colour(*1).

Also, as a side note, XML isn't dead, just moved in the background where it belongs - although not everywhere. Some of the savings imagined did not come through - for example browsers still need an additional generator/parser to produce JSON :(

XHTML

XHTML in turn wasn't anything really new, just HTML(4) with

  1. all crap dropped that was added over time and
  2. straightening the syntax to clean XML
  3. focus strictly on markup
  4. move all display generation into CSS
  5. restrict mix up of interactive elements (AKA JS)

Essentially removing all sloppy constructs, that only worked because most browsers fixed them, or worked with implied assumptions and make HTML about markup, nothing else.

Point 1 created again much aversion by application programmers, not just because their beloved <Marque> tag was gone but it also meant to move all presentation into new and very strange CSS files or inline - which was perceived as bloaty and useless.

Similar point 2. After all, what good is it to change <BR> into <br/>? There was a general notion about all those changes being unnecessarily and limiting. Pages were obviously working the way they are, so why change?

Not to mention that at that time many were still thinking in terms of hand created HTML and interaction using Perl.

Last but not least #4 and #5. In the end, XHTML was - much like XML - a quite abstract clean sheet design. Like often asking way too much understanding from not-as-academic users. For a new design it's normal that not everything is supported from the start and will be added with subsequent components. But that's not what people at the front lines of web design need. They want every thing they had plus all their wishes for new elements and functions fulfilled. And that's where HTML5 came in.

HTML5

HTML5 comes from the same direction as XML - getting rid of old crap and straightening structure and separation of markup vs style vs interaction - but WHAT-WG targeted all of that in a way more down to earth way: concrete elements and functions, allowing quirks to continue to work and so on.

In addition, HTML5 offered a strong support of new features by adding JS APIs, like Geo-Location, Server Side Events and tons of other new hooks to allow much more interaction, all the way to drag&drop. Things that before did work different on each browser if at all.

But most of all, HTML5 focused on selling all as continuation, being a 'minor' change to HTML4, not an all changing new thing one has to learn from scratch. (*2)

On the long run everything works like it was envisioned with XML - and generated almost exclusive by tools, hiding the at least as bloaty nature.

XSLT etc.

XSLT in turn was a great idea to manipulate XML trees. And like XML it's nothing most programmers ever see or handle directly. It's a special case for interfacing/document conversion. As a generalised tool it's of course way more powerful than what CSS can offer to adapt a specific document type to generate various output. But CSS as special to type language for a narrow defined use case does require less effort to get the same result. Kind of CISC vs. RISC.

I for one love to use XSLT to transform structured documents into every format they need to go - including feeding data created in web applications into mainframe applications that sill think they are reading punch cards :))


*1 - Something Microsoft did understand quite well, making Internet Explorer very well received at some level.

*2 - It's the old story of people not liking fundamental change. They want more of the same. IMHO being a close to 100% clone of CP/M was a main reason why MS-DOS became the success we know. It might not have been received the same way if it had started out with the 2.0 file interface. Once people had moved to MS-DOS, using it the same way as CP/M before, introducing new features that replaced everything wasn't received with much resistance. Or let's say not much, I remember several comments about tose useless new functions wasting time and alike :)

5
  • 2
    +1 for talking about the pure data side. I for one am just hearing about the idea of xml and xsl for generating web pages....... that does seem like a lot of work 😜 I remember xml as a brilliant data integration point for web- and web-service-centered applications, that must have lost out to json for speed back in that day
    – Mike M
    Commented Apr 30 at 2:21
  • 1
    I think the main reason JSON won over XML is that XML is wordier (you need a matching closing tag around everything) and hence also more fragile (because that closing tag could be wrong). This was seen as a feature by the "validate everything" crowd, but a bug by the "tolerate everything" crowd, and HTML (barring XHTML) has always been the latter. Even on the data side, where validation might be more beneficial, the wordiness was its downfall, both in fragility and larger documents.
    – Miral
    Commented Apr 30 at 6:02
  • You're missing the rear backtick on <br/> so it's going straight to the browser. I can't fix it because it's a 1-char change.
    – davolfman
    Commented Apr 30 at 23:05
  • @davolfman Oops. Thanks.
    – Raffzahn
    Commented Apr 30 at 23:48
  • 1
    @davolfman which is entriely appropriate in an answer discussing XML and HTML ;-) Commented May 1 at 8:12
5

I remember a transitional period prior to widespread adoption of XHTML when some major web sites started deploying XML web pages that wouldn't render on people's machines due to poor browser support.

What I think you're talking about is the concept that instead of producing [X]HTML, one would serve to browsers:

  • a semantic XML document, written with a custom set of elements representing the raw data you wanted to present;

  • an XSLT stylesheet that would transform this data into the presentational elements that make up a web page, linked from the XML using an <?xml-stylesheet?> processing instruction.

  • (There was also a variant of this vision where the target document produced by the transform would be XSL-FO. Yikes.)

Some people were convinced that this was The Future of the Web, but judging by the comments here few remember it now! (I do because I tech-edited a book about stylesheets whose authors were sure this was inevitable, spent a disheartening proportion of the text coping with it, and couldn't be persuaded otherwise.)

As I recall it, the ‘adoption’ was largely limited to tech demos; I don't think any ‘major sites’ deployed it. This would partly have been due to browser support issues as mentioned, but also:

  • implementations weren't fast; and XSLT being a transform that re-orders the whole document, you lost progressive rendering;

  • beyond the problems of pages breaking due to XML well-formedness errors (which honestly in my view have been overstated), any error or browser compatibility problem in the XSLT stood a high chance of rendering the whole page immediately totally unusable;

  • the document model would be so different in this setup that a lot of copypasta CSS and especially JS that authors would have wanted to keep wouldn't work;

  • XSLT feels super-powerful and elegant when you start using it, but when you get down to the mundane job of producing web pages with it, the dull reality is it's just way less practical than bog-standard templating languages. Managing the ordering of all the matches in the page quickly becomes a chore and absent a real programming language behind it you end up with some truly gross XPath expressions to clumsily attempt string processing. Especially as browsers remained limited to XSLT 1.0.

But mostly: there was minimal real-world benefit in having the semantic source data and the web page be the same document. If you wanted to expose your data you would likely be just as happy providing it through an API and not having to tightly couple the API and UI.

Ultimately as with the Semantic Web in general, it's an attractive vision unencumbered with an answer to the question “what material benefit is there to the individual author of making this available?”.

5

Microsoft Internet Explorer 6 Didn’t Handle It Right, So No One Could

W3C wanted to start over with a more-structured scripting language because HTML had turned into “tag soup.” That is, authors broke all the rules, and browsers just tried their best to display it anyway. People just wrote whatever looked right on their browsers and didn’t care if it was correct. This was terrible for anyone writing a new browser, because the spec had become useless. They had to handle all the broken pages out there exactly the same way as all the other browsers did. If you didn’t duplicate all the undocumented behavior of the other browsers out there, and do exactly the same thing as them for all the HTML that did everything wrong, your new browser wouldn’t be able to display most of the pages on the Web.

The way W3C was going to solve that was starting over with a new format that explicitly broke backward compatibility. Not only that, it would have the “death penalty” from the start: any compliant user agent must refuse to display an XHTML web page that contained any error. W3C also recommended that XHTML be served with the application/xhtml+xml MIME type.

Microsoft didn’t support any of that. It would accept and display XHTML, but only if it was served with the text/html MIME type. It would interpret it as just some strange dialect of HTML it didn’t recognize, and display it as tag soup, not following any standard. Every server worked around this by telling browsers that their XHTML was text/html, and the other browsers tried to display the pages that Internet Explorer could.

As a result, a large portion of the actual XHTML on the Web, even by professional web designers, failed to formally validate. Trying to enforce the spec at that point was a lost cause: Nobody would ever use a browser that couldn’t display a large share of the existing Web, even if some published standard said they were supposed to reject it.

At that point, whatever new features XHTML offered (with a few exceptions such as MathML) got added to a backward-compatible HTML 5, and there was no benefit to switching.

3

Around the time that the XML effort started, HTML was being pulled in multiple directions as different groups wanted to extend HTML in different ways (e.g., Wireless Markup Language: https://en.wikipedia.org/wiki/Wireless_Markup_Language). XML started as a 'put up or shut up' effort when some SGML-inclined people, principally Jon Bosak of Sun Microsystems, made the case for needing something SGML-like that was designed to be extensible.

The abstract for the XML Recommendation is:

The Extensible Markup Language (XML) is a subset of SGML that is completely described in this document. Its goal is to enable generic SGML to be served, received, and processed on the Web in the way that is now possible with HTML. XML has been designed for ease of implementation and for interoperability with both SGML and HTML.

Whether or not the wider W3C originally thought that the SGML Editorial Review Board (as it was) would produce anything useful, the W3C pivoted partway through the development of XML, and suddenly XML was everywhere.

In May 1997, Tim Bray reported (https://lists.w3.org/Archives/Public/w3c-sgml-wg/1997May/0079.html):

  1. We have a strong political reality to deal with here in that for the first time, the big browser manufacturers have noticed XML and have together made a strong request: that error-handling be completely deterministic, and that browsers not compete on the basis of excellence in handling mangled documents. It was observed that if they wanted to do this, they could just do it; but then pointed out that this is exactly why standards exist - to codify the desired practices shared between competitors. In any case, if we want XML to succeed on the Web, it will be difficult to throw the first serious request from M & N back in their face.

  2. In fact, everyone on the ERB substantially agrees with M&N's goal, in that we do not, ever, want an XML user-agent to encounter a WF error and proceed as though everything were OK. Our disagreements centre on how to use the spec machinery to achieve this.

The browser manufacturers may have meant HTML5-style "completely deterministic" error handling, where there's a path for every conceivable input token. While you can do that for a fixed tag set, you can't really do that when there is no fixed tag set, so XML went for 'draconian error handling', where there is no forgiving error recovery mechanism.

Browsers took the path of giving the YSOD for XML errors and, eventually, did 'just do it' and made their own standard for the deterministic error recovery that they wanted.

2

First, I think it's important to make a distinction between two different types of languages:

  • A markup language is for writing text documents with extra features, like font styles (bold/italic), hyperlinks, images, lists, or section headers.
  • A data serialization language is for representing data objects, which may contain numbers, strings, binary blobs, or other objects. Such a language may be represented as human-readable text for convenience, but the data itself need not conceptually be text.

So, I'll consider these separately.

XML as a data serialization language

XML can be used as a data serialization language, and is widely supported as one. Microsoft's .NET Framework provides built-in support for it, and its documentation provides an example.

<?xml version="1.0" encoding="utf-8"?>
<PurchaseOrder xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns="http://www.cpandl.com">
    <ShipTo Name="Teresa Atkinson">
        <Line1>1 Main St.</Line1>
        <City>AnyTown</City>
        <State>WA</State>
        <Zip>00000</Zip>
    </ShipTo>
    <OrderDate>Wednesday, June 27, 2001</OrderDate>
    <Items>
        <OrderedItem>
            <ItemName>Widget S</ItemName>
            <Description>Small widget</Description>
            <UnitPrice>5.23</UnitPrice>
            <Quantity>3</Quantity>
            <LineTotal>15.69</LineTotal>
        </OrderedItem>
    </Items>
    <SubTotal>15.69</SubTotal>
    <ShipCost>12.51</ShipCost>
    <TotalCost>28.2</TotalCost>
</PurchaseOrder>

While this works, the format is pretty verbose. Note that every element name is written twice: For the open tag <X> and the close tag </X>. There are three namespace URIs specified, even though in practice most XML-handling code doesn't need to handle element name conflicts.

Also, why is ShipTo.Name an XML attribute instead of an element? Such a distinction makes sense in a markup language like HTML, which often (but not always) follows the idea that element contents are user-visible while attributes are markup metadata, as in <a href="$URI">$TEXT</a>. But in a data serialization language, having two ways specifying object fields just seems redundant.

Compare with the equivalent JSON literal:

{
    "ShipTo": {
        "Name": "Teresa Atkinson",
        "Line1": "1 Main St.",
        "City": "AnyTown",
        "State": "WA",
        "Zip": "00000"
    },
    "OrderDate": "Wednesday, June 27, 2001",
    "Items": {
        "OrderedItem": {
            "ItemName": "Widget S",
            "Description": "Small widget",
            "UnitPrice": 5.23,
            "Quantity": 3,
            "LineTotal": 15.69
        }
    },
    "SubTotal": 15.69,
    "ShipCost": 12.51,
    "TotalCost": 28.2
}
  • It's 36% shorter than the XML document.
  • There's a syntactic distinction between string literals (with quotation marks), and numeric literals (without quotes). XML is stringly-typed, and thus lacks a convenient way to distinguish the number 123 from the string "123", or a null value from the empty string "".

For reasons like these, many developers now prefer to use JSON or YAML for data serialization. After all, they were designed for this purpose. XML was not, as evidenced by the fact that it's named XML (eXtensible Markup Language) instead of something like XDSL (eXtensible Data Serialization Language).

XML as a markup language

There is of course, an XML-based markup language for the Web, named XHTML. So the question here is: Why did HTML5 win out over XHTML?

XHTML's strictness did more harm than good

XHTML requires well-formed XML syntax, which means stricter parsing rules and less tolerance for errors. In theory, this was a feature, in that it allowed parser implementations to be simpler, so that they wouldn't have to work around traditional HTML's infamous "tag soup".

In practice, there were a lot of HTML-generating tools that used simple string concatenation instead of a proper HTML-building library, and didn't pay attention to generating "well-formed" code. Developers generally didn't like seeing their pages fail to render because of a syntax error, and preferred HTML's more forgiving approach. Postel's Law applies.

At least HTML5 made an attempt to standardize error handling, so that "sloppy" HTML code will be handled more consistently.

Backward Compatibility

Closely related to the above, HTML5 was designed to be backward compatible with older versions of HTML, making it easier for existing websites to transition to HTML5 without significant changes. Other than declaring the document as <!DOCTYPE html> to mark it as HTML5, you might not even have to modify your HTML4 documents at all.

OTOH, XHTML's stricter parsing model made it less compatible with existing HTML documents.

Browser support

Microsoft Internet Explorer simply didn't support XHTML files until version 9 in 2010.

Now, there was a workaround: By using XHTML's Appendix C compatibility guidelines to smooth over the syntactic differences between HTML4 and XHTML, and having User-Agent-conditional logic to set the Content-Type to text/html on Internet Explorer but application/xhtml+xml on better browsers, you could have your XHTML files work on IE.

But, it was simpler to just use text/html for all browsers, negating any perceived advantage you may have gotten from XHTML.

And, as @user3840170 pointed out in a comment, doing so would have made the page load faster on Firefox, which supported progressive rendering for HTML but not for XHTML.

0

Probably all said before, but here is my impression of things:

  • SGML was incredibly cool (like implicit end tags), but expensive on CPU and memory cycles, so the initial HTML (up to version 3.2 probably) was kind of "Lightweight SGML".
  • While SGML's idea was a content model and strictly semantic "tagging", people using HTML used tags to cause specific rendering, not to mark specific elements (like "Let's start with <h3>, because <h1> is too big for my taste"). The HTML at that time was very much "tag soup" (just an arbitrary collection of (mostly) balanced start and end tags).
  • My guess is that people wanting more structure, going back to the idea of SGML, invented XHTML (In the meantime XML was invented to be a subset of SGML, intended for more painless use), hoping to bring back a proper structure in the "tag soup". Actually HTML 4.0 or later went in that direction already. I also think XHTML was intended to by styled by CSS heavily, while one could still rely on the default styles (as in classic HTML).
  • HTML5 is typically combined with "heavy scripting" (and all kind of nonsense, IMHO) where you can assign any semantics to almost any element. This involved the DOM (Document Object Model), allowing the whole "document" to be read and written by scripts (causing terrible performance sometimes, too). Any text could be a button (for example), and you can mis-use elements to heavily that browsers and their tools can get confused ("Why can't my password manager be use with site XY?). XSLT in another technology defined to be applied to properly structured documents (browser implementations still vary). Maybe HTML5 could be summarized with "tags lost their meaning", because you can add any tagname you like and attach runtime semantics to it via scripting (you know that fun pages, where a button "flees" from your mouse pointer?)

While back in the HTML times one could view and use a web page without any scripting support, most web pages presented today cannot even be viewed without scripting support.

A final opinion on XML: I think XML is mis-used in too many cases to a syntax that allows to represent tree-like structures, without specifying the structure of the elements, nor its contents (If interested, here is a nice example how not to use XML).

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .