- From: <noah_mendelsohn@us.ibm.com>
- Date: Thu, 5 Mar 2009 14:46:11 -0500
- To: Anne van Kesteren <annevk@opera.com>, elharo@metalab.unc.edu, Henri Sivonen <hsivonen@iki.fi>, "Henry S. Thompson" <ht@inf.ed.ac.uk>, Julian Reschke <julian.reschke@gmx.de>, "Michael(tm) Smith" <mike@w3.org>, David Orchard <orchard@pacificspirit.com>, www-tag@w3.org
I wrote this email a few weeks ago, but it's just been referenced again in
a TAG F2F discussion, minutes of which will likely come out within a week
or so. That caused me to reread it, and to notice that there are a number
of typos. Most of these I won't bother to correct, but one is so
embarassing that I'm moved to point it out:
I wrote:
> I don't think that the documentation for any one of those
> failure or recovery strategies should be inexplicably bound to
> the specification for the language syntax and its interpretation.
Well, both are true I suppose, but I hope it's obvious that I really
meant:
"I don't think that the documentation for any one of those failure or
recovery strategies should be >inextricably< bound to the specification
for the language syntax and its interpretation."
Noah
--------------------------------------
Noah Mendelsohn
IBM Corporation
One Rogers Street
Cambridge, MA 02142
1-617-693-4036
--------------------------------------
Noah Mendelsohn
02/18/2009 11:38 PM
To: elharo@metalab.unc.edu
cc: Anne van Kesteren <annevk@opera.com>, Henri Sivonen
<hsivonen@iki.fi>, "Henry S. Thompson" <ht@inf.ed.ac.uk>, Julian Reschke
<julian.reschke@gmx.de>, "Michael(tm) Smith" <mike@w3.org>, David Orchard
<orchard@pacificspirit.com>, www-tag@w3.org, www-tag-request@w3.org
Subject: Re: HTML and XML
Elliotte Harold wrote:
> I'm not aware of any current specs that attempt to prescribe
> the handling of a byte stream received over HTTP,
Well, I think it's clear that there are normative specifications that
define the correct >>interpretation<< of a {media-type; octet-stream} pair
received over HTTP (see [1]). I agree that HTTP and associated
specifications do not typically "prescribe the handling" of such streams.
Indeed, my point in this note is to discuss the distinction between a
specification for "correct interpretation" and one for "prescribed
handling". As it happens, I see that distinction, and the disagreements
some of us have about it, as fundamental to the difficulties we all have
coming to easy agreement on how best to deal with error handling in
specifications like HTML and XML.
By "correct interpretation" of the pair I mean that the specifications
tell you what you can conclude. For example, if I serve:
Content-type: application/xml
Entity-body: <a><b><b/></a>
the specifications allow me to conclude that two elements have been
transmitted, one named 'a', the other 'b', with the latter nested in the
former. What RFC 2616, RFC 3023, and the XML Recommendation do not tell
me, is the "prescribed handling". For example, do I show the elements on
on the screen, should I apply CSS to them, store them in a database, or
even perhaps decide that my application is going to thrown an
application-level error for a root element of 'a', even though it's
perfectly legal XML.
Now, if I receive the same entity body with a different media type:
Content-type: application/octet-stream
Entity-body: <a><b><b/></a>
I cannot conclude anything about elements. The resemblance to HTML or
even Unicode characters may be coincidental (if unlikely). All I can
conclude is that I've received a sequence of bits, with some suggestion
that they be treated in groups of 8. Again, nothing in the pertinent
specifications tells me what the prescribed handling is. A browser user
agent retrieving this pair may have some conventions, perhaps to offer to
save a file, but another user agent might quite reasonably do something
else or declare an application-level error.
A third case:
Content-type: application/xml
Entity-body: <a></b>
Here we can conclude that the data received is not legal per the
applicable specifications. What to do about that, though, is not (I
think) specified by the XML Recommendation, which is referred to by RFC
3023, which is referred to indirectly by the HTTP specification (RFC
2616). So, >prescribed handling< is again not given; just the conclusion
that the data is not legal per the specs. Since the data is not legal,
anything a user agent might do to help you recover, such as pointing out
where the tags don't match, is beyond >this layer< of the specifications.
It's sort of like the C Language Reference and the specification you would
write for Lint. Both are useful specifications, but it's a good thing
that they are separate. You can imagine lots of lint-like tools, with
different behavior, that would help different communities of C users deal
with various potential problems in their (purported) C code.
The same is true for XML, I think. Your data is either legal XML or it
isn't; that's not a statement about processing, it's just a fact. I
choose to think that what I want to do about illegal XML depends on the
circumstance. For mission critical applications of XML as a data format,
a surely want to decline to process the data I've received, but I might
want to run some tools that help me isolate the errors. For less critical
applications I might want to do what XML5 advocates seem to favor, I.e.
fix up the input as best I can and proceed. I don't think that the
documentation for any one of those failure or recovery strategies should
be inexplicably bound to the specification for the language syntax and its
interpretation. Indeed, I think the XML Recommendation goes just a bit
too far. The language spec should say: "here's what's legal XML, and
here's what you can extract from legal XML". Full stop. Specifications
for pieces of software that deal with data purported to be XML are also
important, but should be separate, IMO. So, XML5 may be useful as a
specification for data that some applications may want to process, but
XML5 should then not be seen as a replacement for XML itself. It should
be seen as a superset to be used with care in places (if any) where it's
perceived to be a net win. Whether the community is on balance well
served by having such an XML5 specification, I'm unconvinced, but there
are good arguments on both sides I think.
Anyway, I've gone into some detail and probably run on too long, but I'm
really only trying to make one point: the specification of correct
interpretation is not the same as the specification for prescribed
handling. I believe that HTTP and the specifications to which it
delegates do mostly the former in discussing Content-type and Entity-body.
HTML 5 does both. As I've stated before, I would prefer if those two
sides of the HTML 5 specification were packaged separately, to the extent
practical. Roughly that would be: one document describing legal HTML 5
and its correct interpretation (in the sense above); the other would be a
specification for what we might call a "full function browser", and that
would be where the fixups for the error cases would be documented. I do
acknowledge that the tight integration of scripting into the browsers HTML
handling greatly complicates this story. I'm not yet convinced that
something like XML5 will on balance be beneficial, but perhaps it would
bring value for certain less critical applications of XML.
Noah
[1] http://www.w3.org/2001/tag/doc/selfDescribingDocuments.html#grounding
--------------------------------------
Noah Mendelsohn
IBM Corporation
One Rogers Street
Cambridge, MA 02142
1-617-693-4036
--------------------------------------
Received on Thursday, 5 March 2009 19:46:56 UTC