- From: Peter Murray-Rust <Peter@ursus.demon.co.uk>
- Date: Sun, 20 Apr 1997 10:41:58 GMT
- To: w3c-sgml-wg@w3.org
<?XML VERSION="1.0" ?> <TREE TITLE="Olea Europaea"/> In message <199704200820.JAA29656@mail.iol.ie> digitome@iol.ie (Digitome Ltd.) writes: > [Sean] > >>A partial document is *not* a useless thing. One of the cool things about > >>XML as a document format is that some of the content can be recovered > >> even in the face of error. > [Peter] > >But the whole point of Tim's suggestion is that the user _wouldn't_ get > >a recoverable portion after the first error. > > Yes and I dont think this will fly. .The big M/N may see the virtues of UA's > having This is a very important discussion as it goes to the heart of what XML is for (*in 1997* and *in 1998+*). My own attitude is to quote Michael Faraday When asked waht was the point of one of his many discoveries: "Madam, what is the use of a new-born baby?" (maybe slightly misquoted). It's like saying, what is the use of C/Java/etc. Personally I think it would be very sad to deliberately limit the power of XML without careful thought. I don't move in the SGML community, and everything I know about it is from c.t.s, this WG and WWW pages. I first cam in contact about 2-2.5 years ago and got a very strong impression that the SGML community REALLY CARED about the accuracy of information. Typical were the messages from Erik Naggum and others pointing out slips in terminology, the importance of EVERY CHARACTER in a document, etc. There are long discussions on c.t.s. about exactly to to transmit a certain character precisely, what to do about whitespace, etc. The impression is very clear that SGML is the most accurate and robust method of storing, transmitting and converting information. Because of this thoroughness and care, many large organisations require that other parties communicate with them in SGML, even though this is expensive and there is a long learning curve. It is primarily for this reason that I have crusaded in the molecular community to use SGML for their information. My enthusiasm for XML is based on the same principles. I am NOT an advocate of arbitrary extensions to HTML for carrying molecular data robustly and accurately (this is a tough battle to fight :-) My impression of XML was that it accepted the *philosophy* of SGML above. I quote from XML-LANG (Abstract and 1.1): <Q>The goal is to enable <I Annotation='PMR'>generic SGML</I> to be served... XML has been designed for ease of implementation and for interoperability with both SGML and HTML.</Q> <Q>XML shall be compatible with SGML</Q>. Now my understanding of SGML is that it works by strict rules. These rules are complex and allow omission of tags, quotes and a lot else, but they are algorithmic and precise. The parser does not have to guess or use heuristics. If an author provides a declaration that says there should be quotes round attributes and there aren't it's an error. If declaration says they can be omitted, then the parser-writer has to support this. My understanding is that XML has been designed to simplify the task of creating documents and vastly easing the burden of parser writers. Both of these also make it easier to agree on a precise specification. Whilst this may be difficult, it's an order of magnitude easier than SGML at least. It would surprise me if by the time of the final draft 'most' of the grey areas in XML-LANG hadn't been identified and solved or flagged as insoluble. I'm worried about the suggestion that parsers should make helpful guesses about the author's intentions. [I'm quite happy for tools to exist which take !WF XML and do their best to convert it to WF. This is undoubtedly an important aspect of document creation.] Here's a potential example: <A ID=c1ccccc1>Benzene</A> The helpful parser sees this and flags an error. It notices the attribute name is ID "Ah we have an ID - we'll fold this to uppercase" [Yes, I know this is unwarranted in a WF document because the type of the attribute is not known, but this is a user-friendly parser and it's making it easy for everyone. So the result is: <A ID="1CCCCC1C">Benzene</A> Unfortunately, the author had no idea that ID was anything special and their database simply called this field 'ID'. But no harm done... [Unfortunately, yes. By the implied semantics of the SMILES notation in chemistry, this is now Cyclohexane - very different from benzene.] My concern is that if the message goes out that XML is as tolerant of errors as HTML and makes best guesses, we can forget about precise passage of information. Since it will be most people's first conscious contact with SGML (other than HTML, where they don't realise this) if an XML system breaks on them, they won't get a very good impression of SGML. Maybe that doesn't matter... The positive thing is that sites are statrting to realise the value of syntactically correct HTML and flagging these documents. XML MUST be at least as brave as this. Perhaps we can develop a W3C-XML stamp that can only be added to a document if it is at least WF. We are all very excited about large Net players being interested in XML. That's great! They are also have to work with DNS, IP, HTTP, etc. AFAIK they accept that they have to work with these standards *precisely* or their materials won't end up where they want in the form they sent them. If XML is to become a WWW standard, then I see it at that level. If both views are to be accommodated, then it would seem essential to have levels of conformance. P. -- Peter Murray-Rust, domestic net connection Virtual School of Molecular Sciences http://www.vsms.nottingham.ac.uk/
Received on Sunday, 20 April 1997 06:50:37 UTC