- From: Kurt Cagle <kurt.cagle@gmail.com>
- Date: Tue, 4 Jan 2011 15:05:30 -0500
- To: John Cowan <cowan@mercury.ccil.org>
- Cc: Henri Sivonen <hsivonen@iki.fi>, public-html-xml@w3.org
- Message-ID: <AANLkTi=GTc-zGENfKbrBEWZALhvUcaZgGojrGwxOnZUp@mail.gmail.com>
> > > Simply because a couple of browser vendors thinks the world revolves > around > > HTML doesn't mean that the rest of the world does. > > Simply because a few thousand XML weenies think the world revolves around > XML doesn't mean that the Real World contains, to a first approximation, > anything but HTML. :-) Hey - always useful to define one's audience :-) Seriously, though, I think this is probably dirty laundry that needs to be aired. There are several tens of billions of HTML documents out there now, mostly HTML 4.0. There are probably a similar number of XML documents out there (of which a fairly small percentage is web content), and in general XML still outweighs JSON in data transfer usage by a considerable amount, especially when serializing databases for wire transfer. At this point only a very small fraction of the HTML out there (maybe 0.01 percent) is HTML5, and most of that is in the use of <video> and <audio> tags. HTML5 is not HTML4, though it is obviously backwards compatible. While it is sometimes convenient to use the weight of existing HTML in these arguments, the reality is that HTML5 is still an ongoing work in progress, subject to change. >From the standpoint of the server, HTML5 vs. XML processing is largely a non-issue, because the processing code involved is usually generative - with the exception of a small handful of tools such as Tag Soup, there are comparatively few server processes that actually consume HTML, and even when they do, it's generally to store as text. Indeed, I suspect that the vast bulk of all HTML content that is produced anymore likely originates not as HTML at all, but as BBCode or Wiki content, then converted via a transformative process to HTML or XHTML as appropriate. XML on the other hand, is consumed and produced in equal measure, which is why the integrity of the incoming content is usually of greater concern. In a way, I think this is one of the major axes of the current debate. The primary consumers of HTML are the web browsers and similar clients. From the perspective of most of these clients, they could care less about what happens on the server side - I know, having done a stint for several years working in the browser space - so long as what comes at them is parseable as HTML. The primary consumers of XML are largely server side or application developers, though there are also data bridges of various sorts as well as frameworks. Most of the major frameworks that are used to build web pages, from Drupal and WordPress to Sharepoint and other professional grade systems, actually create well-formed XML content that is then served up as text/html, primarily because Microsoft has only JUST begun to recognize application/xhtml+xml as a legitimate mime-type. This means that in general the HTML content that isn't well formed that comes from the server is produced by people who are simply ignorant of XML conventions rather than from people that are deliberately following the HTML conventions. One question that should be asked is how much of the "ill-formed" (from the XML perspective) comes from developers coding websites mostly be hand (perhaps with a JSP or similar substitution layer handling individual text substitution) and how much comes from web application frameworks? If the former dominates (and will continue to dominate), then I think that the argument of HTML5 as a language distinct from XML makes sense. If the latter dominates (as I suspect it does), then the argument is weak at best - software will need to be rewritten to support HTML5 anyway, and at that stage the changes necessary to move from ill-formed XML to well-formed XML (even one without namespaces, which has NEVER been a requirement to create XML) are comparatively trivial to make. HTML5 parsers can continue to handle the weakly validating case without processing the content as XML during a transitional period, but at the same time the HTML5 recommendation can * recommend* the use of XML well-formedness moving forward, which means that future development can move towards a common basis. At the same time, a Micro-XML can more forward that describes the most simplified subset possible of XML (one not using namespaces but that is still well-formed) to make such XML easier to work with in the HTML environment where it might be an issue. Kurt Cagle
Received on Tuesday, 4 January 2011 20:10:08 UTC