17
  1. I was wondering what "profile" means in Wikipedia:

    XML is a profile of an ISO standard SGML, and most of XML comes from SGML unchanged.

  2. According to http://xml-tips.assistprogramming.com/sgml-xml-html-xhtml-all-together.html:

    HTML is a subset of SGML.

    XML is a highly functional subset of SGML.

    XHTML extends and subsets HTML.

    Does "one being a subset of another" mean that code in the first is also syntactically correct and semantically the same as in the second?

    As in the sense of elementary set theory,

    • are HTML, XML and XHTML all different subsets of SGML?
    • do XML and HTML almost not intersect each other?
    • is XHTML a superset of both XML and HTML?
  3. Can I expect some more concise and clear summation of the differences in the purposes of the four and/or when to use which, than the link above? I am really confused about the clear line between their intended purposes.
  4. According to http://xml-tips.assistprogramming.com/sgml-xml-html-xhtml-all-together.html:

    XML is not a single Markup Language. It is a metalanguage to let users design their own markup language.

    I was wondering how to understand XML and HTML are both subsets of SGML, but HTML is a markup language while XML is not a markup language but a metalanguage for designing markup languages?

    Are SGML and XHTML both also metalanguage for designing markup language?

  5. As in both links mention that HTML is an applicaiton of SGML as well as a subset of SGML, and XHTML is an application of XML. I wonder what differences are between saying one language is an application of another, and one language is a subset of another?

4 Answers 4

9

HTML and XML are both markup languages (hence the *ML). XML is a generic markup language suitable for representing arbitrary data, while HTML is a specific markup language suitable only for representing web pages.

HTML and XHTML are both subsets only of SGML, except that XHTML has additional specifications so that it also validates as XML. Think of XML as XHTML's influential godfather.

Because of this relationship to SGML across all 3 of these languages, there are a lot of similarities, but they are all considered different languages. However, much of what defines these languages is their restrictions on SGML.

  • HTML restricts SGML by defining a list of tags that are allowed to be used.
  • XML restricts SGML by not allowing unclosed or empty start and end tags, and forces attributes to be explicit. XML also has a large number of additional restrictions that are not found in SGML.
  • XHTML restricts SGML with the tags from HTML (with some exclusions, such as frameset, et al), and with the tag and entity restrictions from XML.

You may find this document helpful, although the technical terms may be hard to digest. http://www.w3.org/TR/NOTE-sgml-xml-971215

XML is not a metalanguage for defining markup languages. Really that's just SGML. XML is simply a data formatting markup language. Your quoted source is using technical terms imprecisely, which is why they are confusing.

Purposes

XML is for defining your own data format. If you wish to pass data between two systems, XML is often the way to do it.

If, for example, you needed to pass a sales order from your website to your billing system, you could create this XML payload:

<order id="12345">
    <name>John Doe</name>
    <item id="443">Adult Diapers</item>
</order>

Your website would then send that XML to your billing system, which could then parse the data from that XML.

XHTML and HTML are obviously just for web pages. XHTML's primary purpose is to remove a lot of the ambiguity that we had in previous years (decades) of web development. Back in the late 90s when I started, we were using HTML 3.2 which allowed for seriously sloppy code. HTML 4+ and XHTML try to remedy that by either strongly suggesting or enforcing explicit closing tags, explicit attributes, and disallowed tags, which makes it easier on both browsers and humans, and avoids unexpected differences in behaviour cross-browser.

7
  • Thanks! (1) Are both HTML and XML subsets of XHTML? (2) Is it correct that neither HTML is a subset of XML, nor XML is a subset of HTML? Do HTML and XML have nonempty intersection, or totally separated from each other?
    – Tim
    Commented Jul 16, 2011 at 10:25
  • (3) What differences are between saying one language is an application of another, and one language is a subset of another?
    – Tim
    Commented Jul 16, 2011 at 10:40
  • There are documents that conform with both XML and HTML; there are documents that conform with XML and not HTML, and there are documents that conform with HTML and not XML. So neither is a subset of the other, but they have a non-empty intersection. Commented Jul 16, 2011 at 12:11
  • @Tim: (1) HTML, XML, and XHTML are not subsets of anything except SGML. They are all different. XML actually has just about nothing to do with HTML or XHTML...it serves a different purpose. XHTML can be parsed as both HTML and XML, but it's used only by browsers as HTML markup. HTML and XML both have a common ancestor of SGML, but are otherwise unrelated. For every intent, they are separate because SGML is so generic.
    – Jordan
    Commented Jul 16, 2011 at 17:27
  • Honestly I think you're diving too deeply into terminology with application vs subset. I don't think there's a distinction between those terms, or if there is, I doubt it's widely agreed on. Suffice it to say that XHTML borrows concepts from XML and is used as a strict subset of HTML. HTML came first. XHTML came afterwards.
    – Jordan
    Commented Jul 16, 2011 at 17:29
6

I'm going to start by saying that XML is a subset of SGML, then XHTML is a subset of XML.

HTML is based off SGML but with some different rules. XHTML is basically an updated version HTML but with some rules put it place so it is also correct XML.

Some notes on how the HTML 5 Standard works with other specifications. http://dev.w3.org/html5/spec/Overview.html#compliance-with-other-specifications

I'm not sure of the differences between SGML and XML or when you would use one over the other. Although XML seems to be commonly used one.

For XHTML and HTML you are probably better off always using XHTML. Errors are easier to find and as a bonus it will also be valid XML.

3
  • Thanks! (1) I was wondering how to understand the two seemingly conflicting facts: XML and HTML are both subsets of SGML, and HTML is a markup language while XML is not a markup language but a metalanguage for designing markup languages? (2) According to your reply, XHTML is a subset of XML. XHTML is a superset of HTML as "XHTML subsets HTML" quoted from one link in my post. So HTML is a subset of XML? I am not sure it is true.
    – Tim
    Commented Jul 16, 2011 at 2:46
  • HTML breaks too many rules to be XML. HTML is closer to SGML I believe. HTML is loose with tags and there is a set number of different tag types. XHTML just the XML version of HTML.
    – WalterJ89
    Commented Jul 16, 2011 at 8:04
  • Thanks! As in both links mention that HTML is an applicaiton of SGML as well as a subset of SGML, and XHTML is an application of XML. I wonder what differences are between saying one language is an application of another, and one language is a subset of another?
    – Tim
    Commented Jul 16, 2011 at 10:42
2

The history of these might enlighten you here. Simply talking about meta-languages, profiles, subsets and instances is a little dry ! I'll try to keep it short and simple.

SGML evolved from GML (Generalized Markup Language) which was devised by 3 IBM engineers in the 1960s as a means of storing elaborate legal, government, industrial and military documents. GML was gradually refined till it was standardized as SGML in 1986.

GML/SGML is not a language per se. It is rather a meta-language, i.e. a language to define conforming languages or the "rules" by which formatting of a variety of elaborate documents could be designed in a generally consistent way. Each different type of document would therefore define its own SGML conforming set of tag names plus associated attributes, as well as any defined formal public identifiers/namespaces, schemas, etc. Each format defined like that became therefore a distinct data storage language for the document type concerned. Because of the consistency between all documents conforming to SGML rules, it is possible to write code to collate/process data within these documents and transfer data between documents sharing a common format.

SGML was found overly elaborate for the numerous but smaller-sized document. So XML was developed between 1996 and 2006 as a subset (the word profile effectively means the same as subset) of SGML that could handle both small and large documents. Being a subset of a meta-language, XML is itself a meta-language, though a simpler one. You could say that XML provides a basis for designing document formats suitable for both easy storage and transfer between systems on a network.

After the standardization of SGML but before it was simplified to XML, the internet emerged and with it a need for a document format that enabled easy transfer and display of both documents and loose data. The result was the HTML language, an instance (occasionally referred to as an application) of SGML with 18 pre-defined tags providing a standardized way to display a variety of data types, e.g. text, images, audio, etc. HTML exploited SGML's allowing some elements to omit start- or end-tags. Subsequent versions of HTML added new tags and attributes to it and made obsolete some existing ones. Until HTML 5, changes to HTML were made so that it always remained a child language of SGML.

After XML was standardized, an instance of it called XHTML came out which combined the existing HTML tag names with XML's rigor on tag closing, namespaces, schemas, etc. XHTML initially held the promise of being useful for storage, transfer and display of data. It seemed to be about to replace HTML as commonest way to display web material - until HTML 5 came out. HTML 5 had some syntactic features that went beyond those defined in SGML so as to provide a richer data display, especially for multimedia laden websites. As time went on, additional features were added to HTML 5 that enriched still more its use for data display/use to the point that it is unlikely to ever be superseded by new XHTML versions, at least as far as display of data concerned. Although standards for HTML & XHTML are done by W3C working groups, actual propagation of these languages "on the ground" is done by progressive web designers and there none more progressive than those working in the media (advertising/PR/marketing) sector: just look at the creativity of advertising agency sites compared to other sites. This sector really took to the new HTML 5 language, delighted in exploiting its capacity for SVG, audio, video and the new APIs. Their ready adoption of HTML 5 led quickly to its popularity among web designers in general, a process accelerated by the online exchange of skills and tricks on YouTube and various other sites. An updated XHTML version, XHTML5, has emerged but it is not really a strict XML derivative but rather a version of HTML5 that is XML-serialized. Only a small proportion of sites appear to have any use for it.

That's the story behind these data languages. I hope it helps you distinguish the meaning and purpose of them all. Philosophically, this story shows how an essential enabling tool (SGML) for a new technology (internet) can, in the new environment with increasingly varied demands, outgrow its original limits yet become conceptually simpler, applicably more versatile and impactfully more powerful.

1

Generally in the standards world, a "profile" of a standard is a selection of options that the standard offers: for example, if the standard allows documents to be encoded in UTF-8 or UTF-16, a profile of the standard might require them to be encoded in UTF-8. The term "subset" has a very similar meaning; though arguably the term "profile" is a little bit wider.

1
  • Thanks! (1) How about the meaning of and difference between "application", "subset" and "profile", as in Part 5 of my questions? (2) In "XHTML is the basis for a family of future document types that extend and subset HTML", does it mean XHTML is a subset of HTML or HTML is a subset of XHTML?
    – Tim
    Commented Jul 16, 2011 at 12:19

Not the answer you're looking for? Browse other questions tagged or ask your own question.