Re: Error handling in XML from Peter Murray-Rust on 1997-04-20 (w3c-sgml-wg@w3.org from April 1997)

From: Peter Murray-Rust <Peter@ursus.demon.co.uk>
Date: Sun, 20 Apr 1997 10:41:58 GMT
To: w3c-sgml-wg@w3.org
Message-Id: <5840@ursus.demon.co.uk>
<?XML VERSION="1.0" ?>
<TREE TITLE="Olea Europaea"/>

In message <199704200820.JAA29656@mail.iol.ie> digitome@iol.ie (Digitome Ltd.) writes:
> [Sean]
> >>A partial document is *not* a useless thing. One of the cool things about 
> >>XML as a document format is that some of the content can be recovered 
> >> even in the face of error. 
> [Peter]
> >But the whole point of Tim's suggestion is that the user _wouldn't_ get
> >a recoverable portion after the first error.
> 
> Yes and I dont think this will fly. .The big M/N may see the virtues of UA's
> having

This is a very important discussion as it goes to the heart of what XML is 
for (*in 1997* and *in 1998+*).  My own attitude is to quote Michael Faraday
When asked waht was the point of one of his many discoveries:
"Madam, what is the use of a new-born baby?"
(maybe slightly misquoted).  It's like saying, what is the use of C/Java/etc.
Personally I think it would be very sad to deliberately limit the power of
XML without careful thought.

I don't move in the SGML community, and everything I know about it is from
c.t.s, this WG and WWW pages.  I first cam in contact about 2-2.5 years
ago and got a very strong impression that the SGML community REALLY CARED
about the accuracy of information.  Typical were the messages from Erik 
Naggum and others pointing out slips in terminology, the importance of
EVERY CHARACTER in a document, etc.  There are long discussions on c.t.s. about
exactly to to transmit a certain character precisely, what to do about
whitespace, etc.  The impression is very clear that SGML is the most accurate
and robust method of storing, transmitting and converting information.  Because
of this thoroughness and care, many large organisations require that 
other parties communicate with them in SGML, even though this is expensive
and there is a long learning curve.

It is primarily for this reason that I have crusaded in the molecular community
to use SGML for their information.  

My enthusiasm for XML is based on the same principles.  I am NOT an advocate
of arbitrary extensions to HTML for carrying molecular data robustly and 
accurately (this is a tough battle to fight :-)

My impression of XML was that it accepted the *philosophy* of SGML above.
I quote from XML-LANG (Abstract and 1.1):

<Q>The goal is to enable <I Annotation='PMR'>generic SGML</I> to be served...
XML has been designed for ease of implementation and for  
interoperability with both SGML and HTML.</Q>

<Q>XML shall be compatible with SGML</Q>.

Now my understanding of SGML is that it works by strict rules.  These rules
are complex and allow omission of tags, quotes and a lot else, but they are
algorithmic and precise.  The parser does not have to guess or use heuristics.
If an author provides a declaration that says there should be quotes round
attributes and there aren't it's an error.  If declaration says they can
be omitted, then the parser-writer has to support this.

My understanding is that XML has been designed to simplify the task of
creating documents and vastly easing the burden of parser writers.  Both
of these also make it easier to agree on a precise specification.  Whilst this
may be difficult, it's an order of magnitude easier than SGML at least.  It
would surprise me if by the time of the final draft 'most' of the grey areas
in XML-LANG hadn't been identified and solved or flagged as insoluble.

I'm worried about the suggestion that parsers should make helpful guesses about
the author's intentions.  [I'm quite happy for tools to exist which take 
!WF XML and do their best to convert it to WF.  This is undoubtedly an 
important aspect of document creation.]  Here's a potential example:

<A ID=c1ccccc1>Benzene</A>

The helpful parser sees this and flags an error.  It notices the attribute
name is ID "Ah we have an ID - we'll fold this to uppercase"  [Yes, I know
this is unwarranted in a WF document because the type of the attribute is not 
known, but this is a user-friendly parser and it's making it easy for everyone.
So the result is:

<A ID="1CCCCC1C">Benzene</A>

Unfortunately, the author had no idea that ID was anything special and their
database simply called this field 'ID'.  But no harm done... [Unfortunately,
yes.  By the implied semantics of the SMILES notation in chemistry, this is
now Cyclohexane - very different from benzene.]

My concern is that if the message goes out that XML is as tolerant of errors
as HTML and makes best guesses, we can forget about precise passage of 
information.  Since it will be most people's first conscious contact with SGML
(other than HTML, where they don't realise this) if an XML system breaks
on them, they won't get a very good impression of SGML.  Maybe that doesn't
matter...

The positive thing is that sites are statrting to realise the value of 
syntactically correct HTML and flagging these documents.  XML MUST be at least
as brave as this.  Perhaps we can develop a W3C-XML stamp that can only be
added to a document if it is at least WF.  

We are all very excited about large Net players being interested in XML.  That's
great!  They are also have to work with DNS, IP, HTTP, etc.  AFAIK they accept
that they have to work with these standards *precisely* or their materials won't
end up where they want in the form they sent them.  If XML is to become a WWW  
standard, then I see it at that level.  If both views are to be accommodated, 
then it would seem essential to have levels of conformance.

	P.


-- 
Peter Murray-Rust, domestic net connection
Virtual School of Molecular Sciences
http://www.vsms.nottingham.ac.uk/
Received on Sunday, 20 April 1997 06:50:37 UTC