Betting our lives on error handling from lee@sq.com on 1997-05-09 (w3c-sgml-wg@w3.org from May 1997)

From: <lee@sq.com>
Date: Thu, 8 May 97 23:03:11 EDT
To: w3c-sgml-wg@w3.org
Message-Id: <9705090303.AA26604@sqrex.sq.com>
[search for "Proposal" if you are in a hurry]

> But are you willing to bet me your life that the average parser writer
> will correctly guess which well-formed string from among those given

No.  But the application may be able to, given enough information.
I have in the past (a long time ago!) used versions of YACC that did
optional error recovery by inserting or deleting symbols.  A C compiler
that used this technology did not generate code if there were errors,
but it _did_ give much better second and subsequent error messages.

Most modern C compilers do this sort of error recovery,
now forbidden in XML.

And if the parser writer is not in fact "average" but "excellent" or
"experienced", do you still want to forbid that person from using XML
in environments where fatal errors are not the correct approach?

If I write an XML parser in C that says
    if (foundAnError) {
	/* sneer at the user */
	Error("You dildo!  You gave me a bad file!\n");
	exit(1);
    }
now
(1) I have written (as I understand it) a conformant XML parser, and
    am correctly passing to the user the first error (which is that
    the user is stuid) and then exiting;

(2) the application using the parser also exits at this point -- if it's
    an editor, no chance to save work;

(3) the user probably doesn't like this program.  I know _I_ wouldn't.

Even SGML does not forbid error recovery.  In fact, SGML's behaviour
on incorrect input isn't defined in the standard.  This is why
Author/Editor can do error recovery and still be a conforming SGML System
(it says it is, on the splash screen, Eve!) -- and so can NSGMLS,
SPAM, Panorama, Omnimark and HoTMetaL (OK, HoTMetaL is an SGML
Application) (swap Application and System if I have them the wrong
way round, I can never remember the obscure terminology, and reading the
definitions in the standard didn't help me!).

SGML defines the concept of conforming (valid) documents.  A system
that works with conforming documents and says so is a conforming SGML
application [4.50, 4.51].  The standard says that it has to _require_
documents to be conforming.  Hence, a system that works with non-SGML
docuemnts (however close those may happen to be to actual conforming
SGML documents) is not an SGML application, and can do what it likes.

But the standard in no way precludes a pre-processing phase that
takes a document and turns it into a conforming SGML document.
Hence, just because HoTMetaL can read Microsoft Word files dosn't
mean it isn't an SGML application (or system or whatever), even
though most Word files are not conforming SGML documents.

So I don[t accept arguments based on ``this is what SGML does, we
need to make the web as robust as SGML'' because this simply isn't
true.  Existing SGML software often does error correction, and
proceeds past the first error.  It generally doesn't do it silently,
though.  I'd hate to see James have to take out the error handling
in NSGMLS which is so useful, for example.

On the other foot, neither does it make sense to _require_ any kind
of error correction.  You'd make it too hard to write XML parsers
for small applications.

Proposal:

So it seems clear to me that
(1) implementers should be encouraged to report errors wherever it
    makes sense to do so.

(2) Validating Parsers must indicate whether a document is conforming
    or not both at the point of the first detected error that precludeth
    conformance, and also at the end of processing, shoudl that be at
    some other juncture.

(3) No file or collection of files can be said to constitute an
    XML document if they are not in fact conformant.  They must
    be well formed, and, if a complete DTD is supplied, entirely valid.

(4) The XML specification should go no further than this.

Lee



> (and from the infinite number of other well-formed strings that could be
> transformed into the original string by interruptions in the
> transmission) was 'intended' by whatever created the original ill-formed
> example?
> 
> 
> -C. M. Sperberg-McQueen
>  ACH / ACL / ALLC Text Encoding Initiative
>  University of Illinois at Chicago
>  tei@uic.edu
> 
>
Received on Thursday, 8 May 1997 23:03:15 UTC