Error handling: yes, I did mean it

To summarize: I proposed that XML processors be required to stop
passing data (other than error notifications) to applications after the
first violation of well-formedness.  Lots of people disagree.  I will
respond to some detail points below, but let me amplify a bit.

Some have expressed doubt that the big M and N would really go for
this behavior.  M is on the list, N soon will be, they can speak for
themselves, but I am representing what I understand to be their 
position.  In the following list, perhaps #4 is the most important.

1. HTML only exists to be displayed.  A botched display is, well,
   unfortunate.  XML would not exist if it did not presuppose sophisticated
   processing by the recipient.  It seems obvious that such processing
   cannot be trusted if the document fails to meet even basic standards
   of well-formedness.  Java is fragile enough as it is without
   trying to build in botched-tag-recovery heuristics.
2. XML docs are expected to support XML Xpointers.  Is it sane to
   try to interpret something like 
   HREF="#HERE,DESCENDENT(1,FOO,BAR,23)" in a non-well-formed doc?
3. XML will be used to support EDI, big-time.  The OFX format behind
   Money/Quicken is about to become XML, it seems.  Obviously, a 
   violation in this case will/should/must lead to instant abort of
   the transaction.
4. The vendors and *serious* information providers at one in wanting
   to create a non-HTML-like culture of publish-it-right on the Net; one
   way to do this is to shout, loudly, that there are a few (simple,
   thank goodness) rules, and they must be obeyed.
5. Clearly, authoring systems and the like must be able to deal,
   transiently, with non-well-formed documents.  Is the input subsystem
   of an editor required to be an "XML processor"?  Why should it?
6. Clearly, it is often the case that XML processors, having encountered
   errors, should in some cases proceed and report on all the following
   errors that can be detected.  It does *not* therefore follow that they
   should merrily go on handing data and markup (of dubious reliability)
   to the application.
7. The vendors want off the "bugwards-compatible" treadmill. We should
   let them and in fact help them.
8. HTML still exists as an escape hatch for those who only care that
   their stuff looks decent on most screens most times.

Terry Allen wrote:
>Oh come on.  That would make it be a violation of conformance to output 
>information about invalid portions of the instance aside from its
>initial fault

Wrong.  It would be legal to go on issuing error messages.

> including indexing of what text they contain (malformed
>or not) and further malformations.  


>Real world demands will require
>that processors not be so dainty.

This is the key point.  My perception is that this is exactly what
the vendors and information providers want.

>Who's going to play cop?  And if this were useful, wouldn't it be 
>useful only if all parsers encountered the same error first?

No, because they're allowed to report all the errors they find.
>[if you're] ... trying to avoid anyone making any effort at error 
>recovery, please give up now, immediately.

That's exactly what I'm trying.  Do you expect malformed PostScript to
produce a valid page, or a birth-date of September 48th to be accepted
by a payroll system?

Another example.  It is very difficult indeed to devise an input
stream that will cause [gtn]roff to complain.  It is very difficult to
hand-author any significant amount of PostScript and not have the
first few drafts thrown out with syntax errors.  Both are successful.

>| the consequences of the second half of that creed have led to intolerable 
>| results in the quality and usability of the data on the Net.  
>Name three such results.

1. David Siegel.  
2. Overlapping FORMs and TABLEs.  
4. Javascript and CSS hidden in comments
5. Internet explorer uses 15 megabytes of memory

James Clark writes:

>>Can any application hope to 
>>accomplish meaningful work in this mode if the document does not even
>>to be well-formed!?!?
>It may very well be able to.   For example, most valid SGML documents are
>not well-formed XML documents, but an XML parser may well be able to handle
>some of the constructs that are legal SGML but not legal XML.  

That's a good point; I could see an allowed special-case
for *valid* SGML documents.

>unquoted attribute values: an XML processor can easily recover from these.

I disagree.  The reliable presence of quotes makes XML attribute processing
dramatically easier than SGML's.  Furthermore, 
 <foo attr="val>some more text &amp; markup... 20k before the
 next quotation-mark
will send most parsers (including at least SP and Lark) into gibber-mode.
I think an unquoted attribute value is *precisely* the kind of thing
that should remove any expectation that a document can be processed
reliably by a receiving application, and which vendors should *not*
be asked to heuristic (v?) their way around.

> requiring that the processor not process anything
>thereafter is totally unreasonable.  As an implementor my reaction to that
>is: No way! I would not be willing to implement such behaviour whatever the
>spec says.

The processor is not forbidden to proceed, looking for more errors (one
of the things that SP is really good at) - it is just forbidden to
pretend that everything is OK and go on passing bogus data to the
application.  You would be within your rights as an implementor to
ignore this, but I, as an application builder, would not assign any
significant client processing role to a component that insisted on
trying to guess the meaning of egregiously broken data.

>The XML spec isn't perfect: it has bugs and things that aren't clear.  XML
>parsers aren't perfect either.  The result is there will be cases where one
>implementation thinks a document is well-formed, but another doesn't.  The
>requirement you suggest would be disastrous in this situation.

That is a good and troubling point.  On the other hand, for the
basics - proper tag/attribute syntax, nesting, and so on - it seems
pretty likely that well-formedness should be *very* unambiguous.
And on balance I am less troubled by the possibility of disagreement
between processors, than by another another I-can-process-buggier-
documents-than-you rat-race, or the possibility of downloading
malformedness-recovery heuristics in every Java applet.

Norbert Mikula writes:
>What about an XML editor ? Of course, I might be in a situation where
>I am, temporarily, in a state where I have no well-formdness.

Right, I agree; although an XML editor that produces valid output
probably cannot use a pure XML processor, in the sense the spec
uses, to read its input anyhow.  The whole editing/authoring
system, seen globally from the outside, probably qualifies as
an XML processor, in that it will not emit documents that are
not well-formed. [or I won't use it]

>Furthermore error messages only make sense if you know the context
>of them. I want to show my user the output of the parser that
>shows where exactly the error occured. You can provide the line-number,
>of course, but you would have to go back to the original document,
>load it non-parsed, so that you can exactly show the user where
>the error occured.

Exactly.  This is exactly what Emacs+SP does, and it seems like
the correct behavior to me.

>So what I would like to focus this disucssion on is a taxonomy of
>error events. If we can come up with a list of errors, the class
>they belong to, and possible ways to handle these errors in terms
>of actions of the application, I guess many people would be happy.

That would make sense, and I could go for that.  I am suggesting a
simpler solution, in the spirit of XML.

Sean McGrath:

>I think it is hugely important to say "this is
>not well formed"
>but overly spartan to then say "so fix it and try again!".

Yes, I agree; emit all the helpful error messages, the more the
better - just don't go on, at the same time, passing data to the
application as though nothing's wrong.

Peter Murray-Rust:

>My basic tenet is that an XML document is either WF or it isn't.  If there 
>is one error, then the result is a null document.  If that isn't true then I 
>think we lose a large number of people who see XML as a robust and 
>reliable way of passing information.  

I wouldn't go that far... I'd say "a null document plus error messages".
To make things simple, I wouldn't even say "a null document", because
that would forbid the processor from giving the app *anything* until
he'd read the whole thing, to make sure it's WF.

Peter Flynn:

>[M & N] have no-one but themselves to blame for this. The facts of the
>situation were explained to them ad nauseam in person and over the
>wires long before the Web passed the point of inflection on its growth 
>Before we expend a large amount of effort on this, can someone confirm
>that it stands some chance of being listened to. 

Yes; and they have learned by experience.  I wouldn't be advancing
this is I wasn't PRETTY DAMN SURE that this is exactly what these
people want.

I dunno, the forgiving behavior of N&M may in the big picture have
been the right thing to do; perhaps an important part of the 
success of the Web.  I.e. we may have a terrible hangover now,
but we had fun last night so to speak.

>Joe and Jill Homepage are not likely to give the proverbial tinker's cuss
>whether their documents are well-formed or not, if the browsers are as
>forgiving and tolerant as N or M. The browsers are going to have to offer
>significant new features to compensate for the penalty of having the 
>parser gag on invalid syntax. 

Joe & Jill should go on using HTML.  I myself will probably go on using
HTML for all sorts of quick-post things.  The significant new features
(most of which involve smarts on the client to process richer data)
require basically sound data; seems like a good trade-off to me.

Sean McGrath again:
>I was simply making the point that the likes of
>    nsgmls foo.sgm | grep -c "^(BAR$"
>can be a useful thing to do even if foo.sgm markup contains errors.

Sure; but if you replace "grep -c" with some fancy java applet that
does a business-critical application, this is no longer useful but
highly dangerous.