Re: HTML and XML from noah_mendelsohn@us.ibm.com on 2009-03-05 (www-tag@w3.org from March 2009)

From: <noah_mendelsohn@us.ibm.com>
Date: Thu, 5 Mar 2009 14:46:11 -0500
To: Anne van Kesteren <annevk@opera.com>, elharo@metalab.unc.edu, Henri Sivonen <hsivonen@iki.fi>, "Henry S. Thompson" <ht@inf.ed.ac.uk>, Julian Reschke <julian.reschke@gmx.de>, "Michael(tm) Smith" <mike@w3.org>, David Orchard <orchard@pacificspirit.com>, www-tag@w3.org
Message-ID: <OF88175321.AAB12F5D-ON85257570.006C1768-85257570.006C9A89@lotus.com>
I wrote this email a few weeks ago, but it's just been referenced again in 
a TAG F2F discussion, minutes of which will likely come out within a week 
or so.  That caused me to reread it, and to notice that there are a number 
of typos.  Most of these I won't bother to correct, but one is so 
embarassing that I'm moved to point it out:

I wrote:

> I don't think that the documentation for any one of those 
> failure or recovery strategies should be inexplicably bound to 
> the specification for the language syntax and its interpretation. 

Well, both are true I suppose, but I hope it's obvious that I really 
meant:

"I don't think that the documentation for any one of those failure or 
recovery strategies should be >inextricably< bound to the specification 
for the language syntax and its interpretation."

Noah

--------------------------------------
Noah Mendelsohn 
IBM Corporation
One Rogers Street
Cambridge, MA 02142
1-617-693-4036
--------------------------------------








Noah Mendelsohn
02/18/2009 11:38 PM

        To:     elharo@metalab.unc.edu
        cc:     Anne van Kesteren <annevk@opera.com>, Henri Sivonen 
<hsivonen@iki.fi>, "Henry S. Thompson" <ht@inf.ed.ac.uk>, Julian Reschke 
<julian.reschke@gmx.de>, "Michael(tm) Smith" <mike@w3.org>, David Orchard 
<orchard@pacificspirit.com>, www-tag@w3.org, www-tag-request@w3.org
        Subject:        Re: HTML and XML


Elliotte Harold wrote:

> I'm not aware of any current specs that attempt to prescribe 
> the handling of a byte stream received over HTTP,

Well, I think it's clear that there are normative specifications that 
define the correct >>interpretation<< of a {media-type; octet-stream} pair 

received over HTTP (see [1]).  I agree that HTTP and associated 
specifications do not typically "prescribe the handling" of such streams. 
Indeed, my point in this note is to discuss the distinction between a 
specification for "correct interpretation" and one for "prescribed 
handling".  As it happens, I see that distinction, and the disagreements 
some of us have about it, as fundamental to the difficulties we all have 
coming to easy agreement on how best to deal with error handling in 
specifications like HTML and XML.

By "correct interpretation" of the pair I mean that the specifications 
tell you what you can conclude.  For example, if I serve:

        Content-type: application/xml
        Entity-body:  <a><b><b/></a>

the specifications allow me to conclude that two elements have been 
transmitted, one named 'a', the other 'b', with the latter nested in the 
former.  What RFC 2616, RFC 3023, and the XML Recommendation do not tell 
me, is the "prescribed handling".  For example, do I show the elements on 
on the screen, should I apply CSS to them, store them in a database, or 
even perhaps decide that my application is going to thrown an 
application-level error for a root element of 'a', even though it's 
perfectly legal XML.

Now, if I receive the same entity body with a different media type:

        Content-type: application/octet-stream
        Entity-body:  <a><b><b/></a>

I cannot conclude anything about elements.  The resemblance to HTML or 
even Unicode characters may be coincidental (if unlikely).  All I can 
conclude is that I've received a sequence of bits, with some suggestion 
that they be treated in groups of 8.  Again, nothing in the pertinent 
specifications tells me what the prescribed handling is.  A browser user 
agent retrieving this pair may have some conventions, perhaps to offer to 
save a file, but another user agent might quite reasonably do something 
else or declare an application-level error.

A third case:

        Content-type: application/xml
        Entity-body:  <a></b>

Here we can conclude that the data received is not legal per the 
applicable specifications.  What to do about that, though, is not (I 
think) specified by the XML Recommendation, which is referred to by RFC 
3023, which is referred to indirectly by the HTTP specification (RFC 
2616).  So, >prescribed handling< is again not given;  just the conclusion 

that the data is not legal per the specs.  Since the data is not legal, 
anything a user agent might do to help you recover, such as pointing out 
where the tags don't match, is beyond >this layer< of the specifications. 
It's sort of like the C Language Reference and the specification you would 

write for Lint.  Both are useful specifications, but it's a good thing 
that they are separate.  You can imagine lots of lint-like tools, with 
different behavior, that would help different communities of C users deal 
with various potential problems in their (purported) C code.

The same is true for XML, I think.  Your data is either legal XML or it 
isn't; that's not a statement about processing, it's just a fact.  I 
choose to think that what I want to do about illegal XML depends on the 
circumstance.  For mission critical applications of XML as a data format, 
a surely want to decline to process the data I've received, but I might 
want to run some tools that help me isolate the errors.  For less critical 

applications I might want to do what XML5 advocates seem to favor, I.e. 
fix up the input as best I can and proceed.  I don't think that the 
documentation for any one of those failure or recovery strategies should 
be inexplicably bound to the specification for the language syntax and its 

interpretation.  Indeed, I think the XML Recommendation goes just a bit 
too far.  The language spec should say:  "here's what's legal XML, and 
here's what you can extract from legal XML".  Full stop.  Specifications 
for pieces of software that deal with data purported to be XML are also 
important, but should be separate, IMO.  So, XML5 may be useful as a 
specification for data that some applications may want to process, but 
XML5 should then not be seen as a replacement for XML itself.  It should 
be seen as a superset to be used with care in places (if any) where it's 
perceived to be a net win.  Whether the community is on balance well 
served by having such an XML5 specification, I'm unconvinced, but there 
are good arguments on both sides I think.

Anyway, I've gone into some detail and probably run on too long, but I'm 
really only trying to make one point:  the specification of correct 
interpretation is not the same as the specification for prescribed 
handling.  I believe that HTTP and the specifications to which it 
delegates do mostly the former in discussing Content-type and Entity-body. 

 HTML 5 does both.  As I've stated before, I would prefer if those two 
sides of the HTML 5 specification were packaged separately, to the extent 
practical.  Roughly that would be:  one document describing legal HTML 5 
and its correct interpretation (in the sense above); the other would be a 
specification for what we might call a "full function browser", and that 
would be where the fixups for the error cases would be documented.  I do 
acknowledge that the tight integration of scripting into the browsers HTML 

handling greatly complicates this story.  I'm not yet convinced that 
something like XML5 will on balance be beneficial, but perhaps it would 
bring value for certain less critical applications of XML.

Noah

[1] http://www.w3.org/2001/tag/doc/selfDescribingDocuments.html#grounding

--------------------------------------
Noah Mendelsohn 
IBM Corporation
One Rogers Street
Cambridge, MA 02142
1-617-693-4036
--------------------------------------
Received on Thursday, 5 March 2009 19:46:56 UTC