RE: Error handling: yes, I did mean it

In short, we, at Microsoft (I discussed this with a lot of persons), 
think that we must start with a fresh start with this new great format 
and define very precisely the error handling strategy for XML.

XML is a new open format and writing an XML parser is easy and should
stay easy.

This is why my position for parsing XML would be :
"Error handling in XML should obey to public rules and not to some
random heuristics".
( I know this seems unbelievable to have to state such things, but hey!,
we know what happens
on the WWW)

The spec already says that all reportable errors should be reported
(under control 
of a user-settable option). I think vendors and implementors should
compete
on the quality of error reporting.

This is great.

But I think it is *essential* to define what happens after the error(s)
are reported.

I see 2 possibilities:

First possibility:
The processor stops doing anything, do not build any internal data
structure corresponding to the
fragment which contains errors and basically the net net is that the
erroneous fragment of an XML document is not useable.
[Tim Bray proposal]

Second possibility:
If there is a strong agreement that the XML syntax is too rigid, let us
change
the XML syntax. This is what I understand when I hear people complaining
about things like:
"<a><b>xxxx </a>"
by saying " this is obvious that it means <a><b>xxx</b></a>"
So technically, this is not error recovery.

For example, we can state that if:
1/ A tag is not closed  
   <a> <b> xxxx </a> 
   <b> is automatically closed before <a> is closed
2/ A tag is not closed and we hit EOF
   <a> <b> xxxx </b>EOF
  <a> is automatically closed
3/ Extra end tags
    <a><b> xxxx </b> </c> </a>
   </c> are skipped
4/ Etc etc .....  you see the problem here. It seems to me that it is
very defficult to propose
easy rules. But I am open to any suggestion.
 
To be short, I think that it is *essential* to state, from the
beginning, what happens after the 
error(s) are reported. 
I do not buy Sperberg-McQueen saying that the problem in the HTML
browsers was
not error recovery but the lack of error messages. Error messages could
always be turned off.
Error messages means UI but there is a tons of applications of XML which
do NOT have UI.
(CDF, Database applications). Error messages could happen in the middle 
of complex scripts and sometime you could skip them in the code.
When you describe mission-critical information like financial
applications, 
you do not expect any error recovery. When you write Java or C code,
you do not expect error recovery. I think also that we must give a sign,
a direction to 
the web community, being hardcore, from the beginning.

If we do not do that, this means that incorrect documents are going to
be published, that they will stay on the web because  some tools are
going to be able to display 
or process them whithout requiring that their author modifies them. 
This means that heuristics are going to be used because users are going
to find them as they
are invented by tool providers. 
Sounds familiar. I hope we will not go this way for XML.

-Jean 


> ----------
> From: 	Michael Sperberg-McQueen[SMTP:U35395@UICVM.UIC.EDU]
> Sent: 	Saturday, April 26, 1997 4:15 PM
> To: 	W3C SGML Working Group
> Subject: 	Re: Error handling: yes, I did mean it
> 
> Summary:  we cannot in practice require that XML processors ignore or
> discard data following the first detected error; as a result we should
> not try to do so, even if doing so were a good idea (which it is not).
> 
> Tim has suggested, and a number of people have supported the idea,
> that
> after detecting a violation of a well-formedness constraint an XML
> processor be required to stop sending information to the downstream
> application.  A number of people have already argued against this
> idea,
> using arguments I agree with and won't repeat.  Here I just want to
> point out that (with a single exception) neither Tim nor anyone else
> has
> made any argument that, even if taken at face value, would lead to the
> conclusion that this is a good idea.
> 
> The sole exception is Tim's rhetorical question "can any application
> hope to do anything useful with ill-formed data?" to which the only
> realistic answer is 'Yes, of course, many applications do hope to do
> useful things with ill-formed data, and some of them are right.' James
> gave this answer, and no one has attempted any refutation, so I won't
> say any more about it.
> 
> Otherwise, the arguments of the Draconian camp are all centered around
> the unquestioned observations that
>   - there are applications where ill-formed data is useless or worse
> than useless, and where ill-formedness must be detected
>   - by their unwillingness to issue error messages, and their
> determination to provide attractive displays even of badly ill-formed
> documents, HTML browser makers have made their own lives very
> difficult
> 
> Neither of these observations supports a blanket ban on error recovery
> by
> XML processors.
> 
> Tim and others have, in the meantime, conceded that some applications
> can usefully attempt error recovery, and hope to salvage the Draconian
> Rule by suggesting that such applications should use programs which
> aren't 'XML procesors' in the strict sense.  This amounts to saying
> "implementors can pick and choose which parts of XML to implement, and
> can keep themselves blameless even when flouting basic requirements of
> the spec, if only they call themselves XML Handlers or some other name
> instead of 'XML processors'".  I cannot think of a worse approach to
> the problem of ensuring uniform error reporting by XML software.
> 
> Whether it's possible to prevent vendors from attempting to compete on
> the basis of the quality of their error recovery, I don't know.  I
> doubt
> it.  I also don't see why it's necessary to prevent it.  It's not the
> error recovery in HTML browsers that has led HTML to its current
> state,
> but the *silence* of that error recovery.  We complain that most
> authors
> validate by looking at the document onscreen -- what else do we want?
> I
> do that myself, in SGML.  Yes, I do check the return code from the
> parser, but I also check to see that everything looks all right -- if
> it
> doesn't, the validity of the document is deceptively hiding errors in
> the tagging or in the style sheet.  The only thing wrong with checking
> by visual inspection is that in most HTML browsers it's not a
> sufficient
> check.  An author who does want to find errors can't do so with the
> software at hand, because the browser won't report them.
> 
> So I agree with whoever it was who said that the real problem is the
> absence of an error-reporting mode in HTML browsers.
> 
> If this is true, then what we need to do is to ensure that XML
> processors *always* allow the user to request error reports, even if
> the
> software recovers from the errors in question.  That way, the user who
> says "program X displays my data all right, why don't you?" can be
> told
> "look, even program X says your document is ill-formed: look at it
> with
> error-checking turned on!"
> 
> As it happens, the xml-lang spec already requires this.  I don't think
> it can realistically or usefully require more, except perhaps that it
> should also explicitly require that the processor notify any
> down-stream
> app, as well as the 'end' user if any.  I don't think it should
> require
> less.
> 
> If we want the culture of XML usage to differ from that of HTML, we
> need
> to ensure that implementors pay attention to the requirement that they
> report reportable errors unless the user says not to.  We can do that
> by
> complaining unmercifully about any implementation that doesn't provide
> error reporting, and by pointing out -- correctly -- that it's not a
> conforming implementation of XML.
> 
> -C. M. Sperberg-McQueen
> 

Received on Tuesday, 29 April 1997 23:29:11 UTC