Re: Error handling in XML from Peter Murray-Rust on 1997-04-19 (w3c-sgml-wg@w3.org from April 1997)

From: Peter Murray-Rust <Peter@ursus.demon.co.uk>
Date: Sat, 19 Apr 1997 08:35:22 GMT
To: w3c-sgml-wg@w3.org
Message-Id: <5796@ursus.demon.co.uk>
This is a very important subject and I think it's come at just the right time.
I am not a compiler expert, but I understand that this is not a trivial
problem to handle.  If we have interactive tools (e.g. something is 
'processing' (parsing) an XML document in an editor) then you need to have 
powerful error handling.

Although the spec refers to a 'processor' and an 'application', I have the 
strong feeling that it's natural and valuable to have more discrete components
in this.  At present we have 2 parsers which tackle a very well defined job - 
taking a document and validating it for wellformednes (and possibly validity) 
and transmitting form of output from a WF document.  In my simple view they are
roughly analogous to sgmls in their place in an SGML/XML system.

<AXIOM NEGOTIABILITY="epsilon">
My basic tenet is that an XML document is either WF or it isn't.  If there is
one error, then the result is a null document.  If that isn't true then I think
we lose a large number of people who see XML as a robust and reliable way of
passing information.  
</AXIOM>

In this respect it's like a computer program.  If you get one error, you don't
get a *.exe (of if you do it ought not to run).  It interests me that sgmls
will output an ESIS stream if there are errors in the document (e.g. missing
IDs).

This is very important to anyone passing technical information.  Single bytes
can be critical, and I'm sure the same is true for many other subjects 
(legal, commercial, etc.)  .  We must remember that many XML documents will
never be read by humans so they mustn't rely on implied semantics for error
recovery.

For me the basic questions are:
	- does the spec anticipate all error conditions?  I suspect that
		XML-LANG is probably fairly close to it though it needs 
		torturing, but XML-LINK hardly addresses errors at all.
		[Note: there are interactions between XML-LINK errors
		and XML-LANG parsers which need to be addressed]
	- in a multicomponent processor, which component has the job of
		catching which error? (parser, link processor, stylesheet mgr)
	- are there areas which are so complex that it will not be possible
		to analyse fully? I am sure the topology of some linksets
		could cause problems - I have already produced AUTO/REPLACE
		cycles (deliberately :-)

After this is settled, it's probably useful to give the implementers some
guidance as to what the minimum expected of them is.  This is not trivial.  For
example, if a link processor detects a violation (perhaps a malformed TEX Xptr)
how does it report it?  It will depend on what it has been sent by the parser.
'Error in TEI Xptr in line 23 at ...CHLID(1,FOO)...'
                                    ^^^^^
If the Xptr was originally included as an entity, the error message will point
to a normalised version, which  may make no sense to the human reader.  (I 
assume sgmls, etc. have been down this road).


In message <3.0.32.19970418223518.009dbec0@pop.intergate.bc.ca> Tim Bray writes:
> In recent discussions, some but not all at the recent WWW6 conference, it has 
> become apparent that we have an opportunity, if we act now, to avoid one of 
> the big problems that has caused HTML a lot of grief.  This is the area of 
> error-handling.  HTML doesn't have any.  As a result, the browser and tool 
> vendors are stuck on an endless treadmill of trying to enhance the system 
> while at the same time handling any and all collections of bytes that Netscape 
> 1.X did.  Get a couple of beers into anyone from the big N or the big M and 
> you'll see some real tears over this.  In my former life as a Web indexer,
> I cried some of those tears myself.  So let's not let it happen again.

Agreed.  One of the many things that has really impressed me is that a clear 
spec makes it far easier to write code.  This is critical for documents
as well.

> 
> The subject is violations of well-formedness.  Well-formedness should be easy 
> for a document to attain.  In XML, documents will carry a heavy load of 
> semantics and formatting, attached to elements and attributes, probably with 
> significant amounts of indirection.  Can any application hope to 
> accomplish meaningful work in this mode if the document does not even manage 
> to be well-formed!?!?

No. The most it can do is present a mixture of the orginal text and error 
annotations.  (It can do this in a very gentle and helpful manner if it wants,
but the result is still null.)

<EXAMPLE YEAR="1997">
There is a (legacy) program in chemistry which reads in molecules and computes
a picture.  This program writes the output in a well-known (fuzzy) legacy
FORTRAN format.  
<FOOTNOTE>
For those of you who don't know FORTRAN, information 
is delineated by which column of a punched card a character appears in.  For 
those of you who have never seen a punched card it's a storage medium of about
0.000001 Mbytes cm^-2).  
</FOOTNOTE>
The second program reads this in (also using the FORTRAN format).  Prog1 (for 
which people pay money) got the column wrong (only by 1 - does that matter so 
much?).  Prog2 (which was free and highly regarded) got the format right.  
This meant that
ATOM   Cl
got converted to
ATOM  C
This 'converted' a Chlorine atom into a Carbon atom.  Take it on trust that
when this is repeated for 10^6 compounds in a company database it's not
a trivial problem.
</EXAMPLE>

My worry is, in fact, the opposite.  Will XML implementors be sufficiently
disciplined communally that they give a byte-for-byte, attribute-for-attribute
element-for-element isomorphic output.  The impression I get is that
many proprietary SGML vendors started with 'their own version' of SGML which 
remained within their products.  (I've never used these, so I may be wrong).
It's axiomatic that no two HTML vendors will produce identical output, input,
display or anything else. 

<AXIOM>
It's critical that XML tools are totally interoperable.
</AXIOM>

<COROLLARY>
If one tool passes an invalid document to a second tool and the second tool
doesn't know it's invalid, then some people's worlds start falling apart.
</COROLLARY>

For this we need tools we can refer to like sgmls.  We need 'gold-standard'
tools that we all agree 'get it right'.  So, for example, no one should release
a parser that doesn't give the same output as <the standard in the community>,
whatever that turns out to be.  Same for links, styles and the rest of it.


> 
> I suggest that we add language to section 5, "conformance", which says:
> 
>  "An XML processor which encounters a violation of the constraints
>   of well-formedness must not thereafter pass any information about
>   text or markup to the application.  It must pass to the application
>   a notification of the first such violation encountered.  It MAY 
>   thereafter, at user option, pass to the application information
>   about well-formedness violations encountered after the first."
> 
> [or in English: you gotta tell the app about the first syntax botch you hit; 
>  you're allowed to send the app more error messages, but you're not allowed 
>  to send anything but error messages after you've detected an error]

This seems fine.  
<FOOTNOTE>
The first error messages I encountered were displayed on an oscilloscope.  
You only ever got one.  If you were lucky it might tell you the binary code
of some register.  But you could infer that either your program or the machine
was invalid.
</FOOTNOTE>
 It's tremendous if you get a list
of meaningful errors; compiler writers are very clever here.  But, when
a beginner gets 1000 error messages from sgmls,
they really need a message that says 'did you forget to include the 
SGML declaration?' :-)  Not trivial.

> 
> If we wanted to avoid phrasing this in terms of the actions of a processor 
> (which might be a good idea in general for the spec) we could redefine 
> "markup" and "character data" in such a way that they are considered not 
> to exist in a document which is not well-formed.

Since I'm arguing that a non-WF document is nearly equivalent
to the null document, this follows trivially.
<FOOTNOTE>
It may contain some information: we may know what version of XML it isn't a WF 
instance of.
</FOOTNOTE>

> 
> Some might argue that this violates the Internet creed: "Be conservative in 
> what you supply, and liberal in what you accept."  I can live with that: 
> the consequences of the second half of that creed have led to intolerable 
> results in the quality and usability of the data on the Net.  Furthermore, 
> if you want to serve up ill-formed dogshit, this will presumably remain
> possible, because: "text/html means never having to say you're sorry."

We have a very attractive series of tools now, each with their rols:
	HTML			XML			SGML
Easy, universal, relies     Simple, accurate,       Very powerful, robust
on the human brain for      tailored for the WWW    Unlimited in its scope    
processing                  Aimed at machines

There is an important role for each.  If you want to carry a poorly defined 
message to a human, HTML is appropriate.  If you want to manage complex 
documents SGML is essential.

Readers of c.t.s may have seen the discussion of 'OMITTAG considered obsolete'.
I confess until I saw this discussion, I had taken the same view, but I'm 
convinced otherwise now :-).  SGML has many roles that XML cannot fill (until 
the machines take over).  What we have to do is show people that XML has 
vast roles that HTML can never fill.

	P.


> Cheers, Tim Bray
> tbray@textuality.com http://www.textuality.com/ +1-604-708-9592
> 
> 

-- 
Peter Murray-Rust, domestic net connection
Virtual School of Molecular Sciences
http://www.vsms.nottingham.ac.uk/
Received on Saturday, 19 April 1997 04:51:46 UTC