- From: <noah_mendelsohn@us.ibm.com>
- Date: Thu, 5 Mar 2009 14:46:11 -0500
- To: Anne van Kesteren <annevk@opera.com>, elharo@metalab.unc.edu, Henri Sivonen <hsivonen@iki.fi>, "Henry S. Thompson" <ht@inf.ed.ac.uk>, Julian Reschke <julian.reschke@gmx.de>, "Michael(tm) Smith" <mike@w3.org>, David Orchard <orchard@pacificspirit.com>, www-tag@w3.org
I wrote this email a few weeks ago, but it's just been referenced again in a TAG F2F discussion, minutes of which will likely come out within a week or so. That caused me to reread it, and to notice that there are a number of typos. Most of these I won't bother to correct, but one is so embarassing that I'm moved to point it out: I wrote: > I don't think that the documentation for any one of those > failure or recovery strategies should be inexplicably bound to > the specification for the language syntax and its interpretation. Well, both are true I suppose, but I hope it's obvious that I really meant: "I don't think that the documentation for any one of those failure or recovery strategies should be >inextricably< bound to the specification for the language syntax and its interpretation." Noah -------------------------------------- Noah Mendelsohn IBM Corporation One Rogers Street Cambridge, MA 02142 1-617-693-4036 -------------------------------------- Noah Mendelsohn 02/18/2009 11:38 PM To: elharo@metalab.unc.edu cc: Anne van Kesteren <annevk@opera.com>, Henri Sivonen <hsivonen@iki.fi>, "Henry S. Thompson" <ht@inf.ed.ac.uk>, Julian Reschke <julian.reschke@gmx.de>, "Michael(tm) Smith" <mike@w3.org>, David Orchard <orchard@pacificspirit.com>, www-tag@w3.org, www-tag-request@w3.org Subject: Re: HTML and XML Elliotte Harold wrote: > I'm not aware of any current specs that attempt to prescribe > the handling of a byte stream received over HTTP, Well, I think it's clear that there are normative specifications that define the correct >>interpretation<< of a {media-type; octet-stream} pair received over HTTP (see [1]). I agree that HTTP and associated specifications do not typically "prescribe the handling" of such streams. Indeed, my point in this note is to discuss the distinction between a specification for "correct interpretation" and one for "prescribed handling". As it happens, I see that distinction, and the disagreements some of us have about it, as fundamental to the difficulties we all have coming to easy agreement on how best to deal with error handling in specifications like HTML and XML. By "correct interpretation" of the pair I mean that the specifications tell you what you can conclude. For example, if I serve: Content-type: application/xml Entity-body: <a><b><b/></a> the specifications allow me to conclude that two elements have been transmitted, one named 'a', the other 'b', with the latter nested in the former. What RFC 2616, RFC 3023, and the XML Recommendation do not tell me, is the "prescribed handling". For example, do I show the elements on on the screen, should I apply CSS to them, store them in a database, or even perhaps decide that my application is going to thrown an application-level error for a root element of 'a', even though it's perfectly legal XML. Now, if I receive the same entity body with a different media type: Content-type: application/octet-stream Entity-body: <a><b><b/></a> I cannot conclude anything about elements. The resemblance to HTML or even Unicode characters may be coincidental (if unlikely). All I can conclude is that I've received a sequence of bits, with some suggestion that they be treated in groups of 8. Again, nothing in the pertinent specifications tells me what the prescribed handling is. A browser user agent retrieving this pair may have some conventions, perhaps to offer to save a file, but another user agent might quite reasonably do something else or declare an application-level error. A third case: Content-type: application/xml Entity-body: <a></b> Here we can conclude that the data received is not legal per the applicable specifications. What to do about that, though, is not (I think) specified by the XML Recommendation, which is referred to by RFC 3023, which is referred to indirectly by the HTTP specification (RFC 2616). So, >prescribed handling< is again not given; just the conclusion that the data is not legal per the specs. Since the data is not legal, anything a user agent might do to help you recover, such as pointing out where the tags don't match, is beyond >this layer< of the specifications. It's sort of like the C Language Reference and the specification you would write for Lint. Both are useful specifications, but it's a good thing that they are separate. You can imagine lots of lint-like tools, with different behavior, that would help different communities of C users deal with various potential problems in their (purported) C code. The same is true for XML, I think. Your data is either legal XML or it isn't; that's not a statement about processing, it's just a fact. I choose to think that what I want to do about illegal XML depends on the circumstance. For mission critical applications of XML as a data format, a surely want to decline to process the data I've received, but I might want to run some tools that help me isolate the errors. For less critical applications I might want to do what XML5 advocates seem to favor, I.e. fix up the input as best I can and proceed. I don't think that the documentation for any one of those failure or recovery strategies should be inexplicably bound to the specification for the language syntax and its interpretation. Indeed, I think the XML Recommendation goes just a bit too far. The language spec should say: "here's what's legal XML, and here's what you can extract from legal XML". Full stop. Specifications for pieces of software that deal with data purported to be XML are also important, but should be separate, IMO. So, XML5 may be useful as a specification for data that some applications may want to process, but XML5 should then not be seen as a replacement for XML itself. It should be seen as a superset to be used with care in places (if any) where it's perceived to be a net win. Whether the community is on balance well served by having such an XML5 specification, I'm unconvinced, but there are good arguments on both sides I think. Anyway, I've gone into some detail and probably run on too long, but I'm really only trying to make one point: the specification of correct interpretation is not the same as the specification for prescribed handling. I believe that HTTP and the specifications to which it delegates do mostly the former in discussing Content-type and Entity-body. HTML 5 does both. As I've stated before, I would prefer if those two sides of the HTML 5 specification were packaged separately, to the extent practical. Roughly that would be: one document describing legal HTML 5 and its correct interpretation (in the sense above); the other would be a specification for what we might call a "full function browser", and that would be where the fixups for the error cases would be documented. I do acknowledge that the tight integration of scripting into the browsers HTML handling greatly complicates this story. I'm not yet convinced that something like XML5 will on balance be beneficial, but perhaps it would bring value for certain less critical applications of XML. Noah [1] http://www.w3.org/2001/tag/doc/selfDescribingDocuments.html#grounding -------------------------------------- Noah Mendelsohn IBM Corporation One Rogers Street Cambridge, MA 02142 1-617-693-4036 --------------------------------------
Received on Thursday, 5 March 2009 19:46:56 UTC