Re: Use cases from Kurt Cagle on 2011-01-04 (public-html-xml@w3.org from January 2011)

From: Kurt Cagle <kurt.cagle@gmail.com>
Date: Tue, 4 Jan 2011 15:05:30 -0500
To: John Cowan <cowan@mercury.ccil.org>
Cc: Henri Sivonen <hsivonen@iki.fi>, public-html-xml@w3.org
Message-ID: <AANLkTi=GTc-zGENfKbrBEWZALhvUcaZgGojrGwxOnZUp@mail.gmail.com>
>
> > Simply because a couple of browser vendors thinks the world revolves
> around
> > HTML doesn't mean that the rest of the world does.
>
> Simply because a few thousand XML weenies think the world revolves around
> XML doesn't mean that the Real World contains, to a first approximation,
> anything but HTML.  :-)


Hey - always useful to define one's audience :-)

Seriously, though, I think this is probably dirty laundry that needs to be
aired. There are several tens of billions of HTML documents out there now,
mostly HTML 4.0. There are probably a similar number of XML documents out
there (of which a fairly small percentage is web content), and in general
XML still outweighs JSON in data transfer usage by a considerable amount,
especially when serializing databases for wire transfer. At this point only
a very small fraction of the HTML out there (maybe 0.01 percent) is HTML5,
and most of that is in the use of <video> and <audio> tags. HTML5 is not
HTML4, though it is obviously backwards compatible. While it is sometimes
convenient to use the weight of existing HTML in these arguments, the
reality is that HTML5 is still an ongoing work in progress, subject to
change.

>From the standpoint of the server, HTML5 vs. XML processing is largely a
non-issue, because the processing code involved is usually generative - with
the exception of a small handful of tools such as Tag Soup, there are
comparatively few server processes that actually consume HTML, and even when
they do, it's generally to store as text. Indeed, I suspect that the vast
bulk of all HTML content that is produced anymore likely originates not as
HTML at all, but as BBCode or Wiki content, then converted via a
transformative process to HTML or XHTML as appropriate. XML on the other
hand, is consumed and produced in equal measure, which is why the integrity
of the incoming content is usually of greater concern.

In a way, I think this is one of the major axes of the current debate. The
primary consumers of HTML are the web browsers and similar clients. From the
perspective of most of these clients, they could care less about what
happens on the server side - I know, having done a stint for several years
working in the browser space - so long as what comes at them is parseable as
HTML. The primary consumers of XML are largely server side or application
developers, though there are also data bridges of various sorts as well as
frameworks. Most of the major frameworks that are used to build web pages,
from Drupal and WordPress to Sharepoint and other professional grade
systems, actually create well-formed XML content that is then served up as
text/html, primarily because Microsoft has only JUST begun to recognize
application/xhtml+xml as a legitimate mime-type. This means that in general
the HTML content that isn't well formed that comes from the server is
produced by people who are simply ignorant of XML conventions rather than
from people that are deliberately following the HTML conventions.

One question that should be asked is how much of the "ill-formed" (from the
XML perspective) comes from developers coding websites mostly be hand
(perhaps with a JSP or similar substitution layer handling individual text
substitution) and how much comes from web application frameworks?

 If the former dominates (and will continue to dominate), then I think that
the argument of HTML5 as a language distinct from XML makes sense. If the
latter dominates (as I suspect it does), then the argument is weak at best -
software will need to be rewritten to support HTML5 anyway, and at that
stage the changes necessary to move from ill-formed XML to well-formed XML
(even one without namespaces, which has NEVER been a requirement to create
XML) are comparatively trivial to make. HTML5 parsers can continue to handle
the weakly validating case without processing the content as XML during a
transitional period, but at the same time the HTML5 recommendation can *
recommend* the use of XML well-formedness moving forward, which means that
future development can move towards a common basis. At the same time, a
Micro-XML can more forward that describes the most simplified subset
possible of XML (one not using namespaces but that is still well-formed) to
make such XML easier to work with in the HTML environment where it might be
an issue.

Kurt Cagle
Received on Tuesday, 4 January 2011 20:10:08 UTC