Re: Error handling in XML

At 22:35 18/04/97 -0700, Tim Bray wrote:
>In recent discussions, some but not all at the recent WWW6 conference, it has 
>become apparent that we have an opportunity, if we act now, to avoid one of 
>the big problems that has caused HTML a lot of grief.  This is the area of 
>error-handling.  HTML doesn't have any.  As a result, the browser and tool 
>vendors are stuck on an endless treadmill of trying to enhance the system 
>while at the same time handling any and all collections of bytes that Netscape 
>1.X did.  Get a couple of beers into anyone from the big N or the big M and 
>you'll see some real tears over this.  In my former life as a Web indexer,
>I cried some of those tears myself.  So let's not let it happen again.

Agreed, so far.

>The subject is violations of well-formedness.  Well-formedness should be easy 
>for a document to attain.  In XML, documents will carry a heavy load of 
>semantics formatting, attached to elements and attributes, probably with 
>significant amounts of indirection.  Can any application hope to 
>accomplish meaningful work in this mode if the document does not even manage 
>to be well-formed!?!?

It may very well be able to.   For example, most valid SGML documents are
not well-formed XML documents, but an XML parser may well be able to handle
some of the constructs that are legal SGML but not legal XML.  Consider
unquoted attribute values: an XML processor can easily recover from these.

>I suggest that we add language to section 5, "conformance", which says:
>
> "An XML processor which encounters a violation of the constraints
>  of well-formedness must not thereafter pass any information about
>  text or markup to the application.  It must pass to the application
>  a notification of the first such violation encountered.

I agree with the requirement that a processor must notify the application of
non well-formedness, but requiring that the processor not process anything
thereafter is totally unreasonable.  As an implementor my reaction to that
is: No way! I would not be willing to implement such behaviour whatever the
spec says.

Such a requirement is much more than is needed to avoid the sort of problems
that have occurred with HTML.  It is enough if applications are required to
inform users that a document is invalid (and provide enough information that
they can fix it).  Serious SGML parsers (SP, Omnimark) make a big effort to
continue producing useful output after the first error.  Does that mean SGML
has had the same problems with error handling as HTML? Of course not.

The XML spec isn't perfect: it has bugs and things that aren't clear.  XML
parsers aren't perfect either.  The result is there will be cases where one
implementation thinks a document is well-formed, but another doesn't.  The
requirement you suggest would be disastrous in this situation.

>  It MAY 
>  thereafter, at user option, pass to the application information
>  about well-formedness violations encountered after the first."
>
>[or in English: you gotta tell the app about the first syntax botch you hit; 
> you're allowed to send the app more error messages, but you're not allowed 
> to send anything but error messages after you've detected an error]
>
>If we wanted to avoid phrasing this in terms of the actions of a processor 
>(which might be a good idea in general for the spec) we could redefine 
>"markup" and "character data" in such a way that they are considered not 
>to exist in a document which is not well-formed.
>
>Some might argue that this violates the Internet creed: "Be conservative in 
>what you supply, and liberal in what you accept."

It certainly does violate it.

>  I can live with that: 
>the consequences of the second half of that creed have led to intolerable 
>results in the quality and usability of the data on the Net.

Not if you require that applications tell users about it.

James

Received on Saturday, 19 April 1997 03:25:17 UTC