Re: HTML and XML

Elliotte Harold wrote:

> I'm not aware of any current specs that attempt to prescribe 
> the handling of a byte stream received over HTTP,

Well, I think it's clear that there are normative specifications that 
define the correct >>interpretation<< of a {media-type; octet-stream} pair 
received over HTTP (see [1]).  I agree that HTTP and associated 
specifications do not typically "prescribe the handling" of such streams. 
Indeed, my point in this note is to discuss the distinction between a 
specification for "correct interpretation" and one for "prescribed 
handling".  As it happens, I see that distinction, and the disagreements 
some of us have about it, as fundamental to the difficulties we all have 
coming to easy agreement on how best to deal with error handling in 
specifications like HTML and XML.

By "correct interpretation" of the pair I mean that the specifications 
tell you what you can conclude.  For example, if I serve:

        Content-type: application/xml
        Entity-body:  <a><b><b/></a>

the specifications allow me to conclude that two elements have been 
transmitted, one named 'a', the other 'b', with the latter nested in the 
former.  What RFC 2616, RFC 3023, and the XML Recommendation do not tell 
me, is the "prescribed handling".  For example, do I show the elements on 
on the screen, should I apply CSS to them, store them in a database, or 
even perhaps decide that my application is going to thrown an 
application-level error for a root element of 'a', even though it's 
perfectly legal XML.

Now, if I receive the same entity body with a different media type:

        Content-type: application/octet-stream
        Entity-body:  <a><b><b/></a>

I cannot conclude anything about elements.  The resemblance to HTML or 
even Unicode characters may be coincidental (if unlikely).  All I can 
conclude is that I've received a sequence of bits, with some suggestion 
that they be treated in groups of 8.  Again, nothing in the pertinent 
specifications tells me what the prescribed handling is.  A browser user 
agent retrieving this pair may have some conventions, perhaps to offer to 
save a file, but another user agent might quite reasonably do something 
else or declare an application-level error.

A third case:

        Content-type: application/xml
        Entity-body:  <a></b>

Here we can conclude that the data received is not legal per the 
applicable specifications.  What to do about that, though, is not (I 
think) specified by the XML Recommendation, which is referred to by RFC 
3023, which is referred to indirectly by the HTTP specification (RFC 
2616).  So, >prescribed handling< is again not given;  just the conclusion 
that the data is not legal per the specs.  Since the data is not legal, 
anything a user agent might do to help you recover, such as pointing out 
where the tags don't match, is beyond >this layer< of the specifications. 
It's sort of like the C Language Reference and the specification you would 
write for Lint.  Both are useful specifications, but it's a good thing 
that they are separate.  You can imagine lots of lint-like tools, with 
different behavior, that would help different communities of C users deal 
with various potential problems in their (purported) C code.

The same is true for XML, I think.  Your data is either legal XML or it 
isn't; that's not a statement about processing, it's just a fact.  I 
choose to think that what I want to do about illegal XML depends on the 
circumstance.  For mission critical applications of XML as a data format, 
a surely want to decline to process the data I've received, but I might 
want to run some tools that help me isolate the errors.  For less critical 
applications I might want to do what XML5 advocates seem to favor, I.e. 
fix up the input as best I can and proceed.  I don't think that the 
documentation for any one of those failure or recovery strategies should 
be inexplicably bound to the specification for the language syntax and its 
interpretation.  Indeed, I think the XML Recommendation goes just a bit 
too far.  The language spec should say:  "here's what's legal XML, and 
here's what you can extract from legal XML".  Full stop.  Specifications 
for pieces of software that deal with data purported to be XML are also 
important, but should be separate, IMO.  So, XML5 may be useful as a 
specification for data that some applications may want to process, but 
XML5 should then not be seen as a replacement for XML itself.  It should 
be seen as a superset to be used with care in places (if any) where it's 
perceived to be a net win.  Whether the community is on balance well 
served by having such an XML5 specification, I'm unconvinced, but there 
are good arguments on both sides I think.

Anyway, I've gone into some detail and probably run on too long, but I'm 
really only trying to make one point:  the specification of correct 
interpretation is not the same as the specification for prescribed 
handling.  I believe that HTTP and the specifications to which it 
delegates do mostly the former in discussing Content-type and Entity-body. 
 HTML 5 does both.  As I've stated before, I would prefer if those two 
sides of the HTML 5 specification were packaged separately, to the extent 
practical.  Roughly that would be:  one document describing legal HTML 5 
and its correct interpretation (in the sense above); the other would be a 
specification for what we might call a "full function browser", and that 
would be where the fixups for the error cases would be documented.  I do 
acknowledge that the tight integration of scripting into the browsers HTML 
handling greatly complicates this story.  I'm not yet convinced that 
something like XML5 will on balance be beneficial, but perhaps it would 
bring value for certain less critical applications of XML.

Noah

[1] http://www.w3.org/2001/tag/doc/selfDescribingDocuments.html#grounding

--------------------------------------
Noah Mendelsohn 
IBM Corporation
One Rogers Street
Cambridge, MA 02142
1-617-693-4036
--------------------------------------

Received on Thursday, 19 February 2009 04:39:30 UTC