Re: Error handling in XML
Wow, this topic has generated some real significant comments, thank you!
Attached are my reactions to several of the comments. I have not made
up my mind on the exact phases that Tim proposes. I have identified on
what I think are the important goals/issues.
1) Positioning of XML re SGML and HTML
Error handling may be the most important detail that we have yet
addressed in positioning of XML re SGML and HTML:
- is it really important to court Joe/Sally Homepage?
- is it really important to court SGML users (eg. corp tech docs)?
- "HTML is great because you can send people broken documents"
- "XML is great because you can't send people broken documents"
Re HTML: I believe that we need to differentiate from HTML.
Structure and domain specific markup is one way. I believe
error handling should be another. I can give up Joe/Sally Homepage
What do we describe as benefits of XML? Can there be benefits if a
significant portion of the XML corpus is not well formed?
Re SGML: it is a real pain to see authors grope with error reports
from batch SGML processors. However, it is not likely that XML will
be processed in batch.
Conformance in SGML is difficult because the spec it self is difficult
and subject to interpretation. XML is much simpler and should have
far fewer "gray" areas.
If someone from corp tech docs decides to go with XML and make the
changes that are required, I would not expect them to give up
precision and expectations of interchange of only well formed
2) Vendor support.
We need to give XML vendors a device that helps them do the
right thing. While we can not prevent vendors from being
non-conformant, we can help them if their desire is to conform.
Specifications can really help during meetings between industry
Clearly, if we do not do anything then conformance will be defined
as the error recovery behavior of a market leading application.
This is not constructive.
How can we help make sure that an author can do the right thing
(a question I am often asked re HTML and there are no easy answers)?
How do we let authors know that publishing this document instance
will embarrass them?
4) Why are we only focused on the client?
The server is a better place to do validation (sorry: WF test)
checks. True, this is not how things are done today but, so what.
This is an "Internet" oriented effort so lets think about the
We are making SGML for the Internet which is not always 100% reliable.
Should not clients be able to provide some value to their users
(human or robot) when there was a channel error?
Summary; I can accept the forcefulness of Tim's proposal but hope that
it an be restructured.
> In recent discussions, some but not all at the recent WWW6 conference, it has
> become apparent that we have an opportunity, if we act now, to avoid one of
> the big problems that has caused HTML a lot of grief. This is the area of
> error-handling. HTML doesn't have any. As a result, the browser and tool
Not true. There is a well stated position that says that a user agent should
try and recover as best able.
> vendors are stuck on an endless treadmill of trying to enhance the system
> while at the same time handling any and all collections of bytes that
> 1.X did. Get a couple of beers into anyone from the big N or the big M and
> you'll see some real tears over this. In my former life as a Web indexer,
> I cried some of those tears myself. So let's not let it happen again.
In retrospect, I suspect that the accept all position may be the hardest
compromise the web had to make. I believe it made the right choice for
> The subject is violations of well-formedness. Well-formedness should be easy
> for a document to attain. In XML, documents will carry a heavy load of
> semantics and formatting, attached to elements and attributes, probably with
> significant amounts of indirection. Can any application hope to
> accomplish meaningful work in this mode if the document does not even manage
> to be well-formed!?!?
What is meaningful? It seems to me this phrase is the key difference between
XML now and HTML past/now.
> I suggest that we add language to section 5, "conformance", which says:
> "An XML processor which encounters a violation of the constraints
> of well-formedness must not thereafter pass any information about
> text or markup to the application. It must pass to the application
> a notification of the first such violation encountered. It MAY
> thereafter, at user option, pass to the application information
> about well-formedness violations encountered after the first."
This makes it hard for editors to be conforming.
> [or in English: you gotta tell the app about the first syntax botch you hit;
> you're allowed to send the app more error messages, but you're not allowed
> to send anything but error messages after you've detected an error]
> If we wanted to avoid phrasing this in terms of the actions of a processor
> (which might be a good idea in general for the spec) we could redefine
> "markup" and "character data" in such a way that they are considered not
> to exist in a document which is not well-formed.
> Some might argue that this violates the Internet creed: "Be conservative in
> what you supply, and liberal in what you accept." I can live with that:
> the consequences of the second half of that creed have led to intolerable
> results in the quality and usability of the data on the Net. Furthermore,
> if you want to serve up ill-formed dogshit, this will presumably remain
> possible, because: "text/html means never having to say you're sorry."
A real quoteable: "text/html means never having to say you're sorry."
that could easly backfire. Possible paraphrases: "You will be sorry if
you want to do more than HTML can" or "Using XML will mean that you
will be sorry."
> Cheers, Tim Bray
> email@example.com http://www.textuality.com/ +1-604-708-9592
> > I can live with that:
> >the consequences of the second half of that creed have led to intolerable
> >results in the quality and usability of the data on the Net.
> Not if you require that applications tell users about it.
The problem is, a browser maker can not impose that sort of error
warning on the "surfer". They can not make the changes to the document
and recieve no immediate benefit from seeing the error message.
How can we describe the goal of this requirement?
"to prevent people from shipping any document that 'displays'
regardless of its structure and syntax"
advoid a corpus of documents that are in-situ valid based on
error recover algorithims.
(in-situ valid := useable in the app current context)
another goal: we do not want people as confused by xml error messages are
authors are when confronted with error messages from "most"
sgml systems. I have seen it happen and the author's reaction
is never pretty and seldom repeatable in mixed company.
> <AXIOM NEGOTIABILITY="epsilon">
> My basic tenet is that an XML document is either WF or it isn't. If there is
> one error, then the result is a null document. If that isn't true then I
> think we lose a large number of people who see XML as a robust and reliable
> way of passing information.
Good axiom to my thinking.
> It's critical that XML tools are totally interoperable.
> If one tool passes an invalid document to a second tool and the second tool
> doesn't know it's invalid, then some people's worlds start falling apart.
> For this we need tools we can refer to like sgmls. We need 'gold-standard'
> tools that we all agree 'get it right'. So, for example, no one should
> a parser that doesn't give the same output as <the standard in the community>,
> whatever that turns out to be. Same for links, styles and the rest of it.
Nice idea, who has the deep pockets and can be assured to continue to pursue
new examples etc?
> Peter Murray-Rust
> So what I would like to focus this disucssion on is a taxonomy of
> error events. If we can come up with a list of errors, the class
> they belong to, and possible ways to handle these errors in terms
> of actions of the application, I guess many people would be happy.
> Best regards,
> Norbert H. Mikula
I agree with the rest of your discussion but regarding error taxonomy,
the probability of listing all error cases is close to nil.
> Before we expend a large amount of effort on this, can someone confirm
> that it stands some chance of being listened to. Forgive my scepticism
> (and as I wasn't at WWW6 I didn't have the chance to hear it from the
> horses' mouths): I know both N and M contain numbers of people who are
> seriously committed to making a better shot at it this time round, but ...
I think we need to do this regardless; however, if the vendors have
suggestions then we should take them into acount since the may significantly
improve the overall results.
> just worried it's not going to wash with the people who have to market the
> product (any marketeers out there?).
Yes. This question is perhaps the most significant in establishing XML's
position relative to HTML and SGML.
> Ah, you mean like <PLAINTEXT> ? :-)
Hummmm...maybe a non-binary requirement is better than simply saying
pass data or not.
> human->editor/validator->XML(WWW)->parser/browser+stylesheet [->printer]
> program ->XML(WWW)->parser(->robot->program)+
> Peter Murray-Rust, domestic net connection
The problem is, today at least, most people write for the first case
and others try to apply the second. What motivation can we have for
the first case?
> For example, (due to the evil "selective ack" design problem in TCP/IP)
> large documents have a high chance of failing in transfers to faraway
> places like here (Australia). So we would would probably look more dimly
> on higher-layer protocols compounding the problem by not even making
> available as much as you managed to get. Certainly for this applies to
> readable documents.
Again, a real good point. We are making SGML for the Internet which is
not always 100% reliable.