- From: Peter Murray-Rust <Peter@ursus.demon.co.uk>
- Date: Sat, 19 Apr 1997 08:35:22 GMT
- To: w3c-sgml-wg@w3.org
This is a very important subject and I think it's come at just the right time. I am not a compiler expert, but I understand that this is not a trivial problem to handle. If we have interactive tools (e.g. something is 'processing' (parsing) an XML document in an editor) then you need to have powerful error handling. Although the spec refers to a 'processor' and an 'application', I have the strong feeling that it's natural and valuable to have more discrete components in this. At present we have 2 parsers which tackle a very well defined job - taking a document and validating it for wellformednes (and possibly validity) and transmitting form of output from a WF document. In my simple view they are roughly analogous to sgmls in their place in an SGML/XML system. <AXIOM NEGOTIABILITY="epsilon"> My basic tenet is that an XML document is either WF or it isn't. If there is one error, then the result is a null document. If that isn't true then I think we lose a large number of people who see XML as a robust and reliable way of passing information. </AXIOM> In this respect it's like a computer program. If you get one error, you don't get a *.exe (of if you do it ought not to run). It interests me that sgmls will output an ESIS stream if there are errors in the document (e.g. missing IDs). This is very important to anyone passing technical information. Single bytes can be critical, and I'm sure the same is true for many other subjects (legal, commercial, etc.) . We must remember that many XML documents will never be read by humans so they mustn't rely on implied semantics for error recovery. For me the basic questions are: - does the spec anticipate all error conditions? I suspect that XML-LANG is probably fairly close to it though it needs torturing, but XML-LINK hardly addresses errors at all. [Note: there are interactions between XML-LINK errors and XML-LANG parsers which need to be addressed] - in a multicomponent processor, which component has the job of catching which error? (parser, link processor, stylesheet mgr) - are there areas which are so complex that it will not be possible to analyse fully? I am sure the topology of some linksets could cause problems - I have already produced AUTO/REPLACE cycles (deliberately :-) After this is settled, it's probably useful to give the implementers some guidance as to what the minimum expected of them is. This is not trivial. For example, if a link processor detects a violation (perhaps a malformed TEX Xptr) how does it report it? It will depend on what it has been sent by the parser. 'Error in TEI Xptr in line 23 at ...CHLID(1,FOO)...' ^^^^^ If the Xptr was originally included as an entity, the error message will point to a normalised version, which may make no sense to the human reader. (I assume sgmls, etc. have been down this road). In message <3.0.32.19970418223518.009dbec0@pop.intergate.bc.ca> Tim Bray writes: > In recent discussions, some but not all at the recent WWW6 conference, it has > become apparent that we have an opportunity, if we act now, to avoid one of > the big problems that has caused HTML a lot of grief. This is the area of > error-handling. HTML doesn't have any. As a result, the browser and tool > vendors are stuck on an endless treadmill of trying to enhance the system > while at the same time handling any and all collections of bytes that Netscape > 1.X did. Get a couple of beers into anyone from the big N or the big M and > you'll see some real tears over this. In my former life as a Web indexer, > I cried some of those tears myself. So let's not let it happen again. Agreed. One of the many things that has really impressed me is that a clear spec makes it far easier to write code. This is critical for documents as well. > > The subject is violations of well-formedness. Well-formedness should be easy > for a document to attain. In XML, documents will carry a heavy load of > semantics and formatting, attached to elements and attributes, probably with > significant amounts of indirection. Can any application hope to > accomplish meaningful work in this mode if the document does not even manage > to be well-formed!?!? No. The most it can do is present a mixture of the orginal text and error annotations. (It can do this in a very gentle and helpful manner if it wants, but the result is still null.) <EXAMPLE YEAR="1997"> There is a (legacy) program in chemistry which reads in molecules and computes a picture. This program writes the output in a well-known (fuzzy) legacy FORTRAN format. <FOOTNOTE> For those of you who don't know FORTRAN, information is delineated by which column of a punched card a character appears in. For those of you who have never seen a punched card it's a storage medium of about 0.000001 Mbytes cm^-2). </FOOTNOTE> The second program reads this in (also using the FORTRAN format). Prog1 (for which people pay money) got the column wrong (only by 1 - does that matter so much?). Prog2 (which was free and highly regarded) got the format right. This meant that ATOM Cl got converted to ATOM C This 'converted' a Chlorine atom into a Carbon atom. Take it on trust that when this is repeated for 10^6 compounds in a company database it's not a trivial problem. </EXAMPLE> My worry is, in fact, the opposite. Will XML implementors be sufficiently disciplined communally that they give a byte-for-byte, attribute-for-attribute element-for-element isomorphic output. The impression I get is that many proprietary SGML vendors started with 'their own version' of SGML which remained within their products. (I've never used these, so I may be wrong). It's axiomatic that no two HTML vendors will produce identical output, input, display or anything else. <AXIOM> It's critical that XML tools are totally interoperable. </AXIOM> <COROLLARY> If one tool passes an invalid document to a second tool and the second tool doesn't know it's invalid, then some people's worlds start falling apart. </COROLLARY> For this we need tools we can refer to like sgmls. We need 'gold-standard' tools that we all agree 'get it right'. So, for example, no one should release a parser that doesn't give the same output as <the standard in the community>, whatever that turns out to be. Same for links, styles and the rest of it. > > I suggest that we add language to section 5, "conformance", which says: > > "An XML processor which encounters a violation of the constraints > of well-formedness must not thereafter pass any information about > text or markup to the application. It must pass to the application > a notification of the first such violation encountered. It MAY > thereafter, at user option, pass to the application information > about well-formedness violations encountered after the first." > > [or in English: you gotta tell the app about the first syntax botch you hit; > you're allowed to send the app more error messages, but you're not allowed > to send anything but error messages after you've detected an error] This seems fine. <FOOTNOTE> The first error messages I encountered were displayed on an oscilloscope. You only ever got one. If you were lucky it might tell you the binary code of some register. But you could infer that either your program or the machine was invalid. </FOOTNOTE> It's tremendous if you get a list of meaningful errors; compiler writers are very clever here. But, when a beginner gets 1000 error messages from sgmls, they really need a message that says 'did you forget to include the SGML declaration?' :-) Not trivial. > > If we wanted to avoid phrasing this in terms of the actions of a processor > (which might be a good idea in general for the spec) we could redefine > "markup" and "character data" in such a way that they are considered not > to exist in a document which is not well-formed. Since I'm arguing that a non-WF document is nearly equivalent to the null document, this follows trivially. <FOOTNOTE> It may contain some information: we may know what version of XML it isn't a WF instance of. </FOOTNOTE> > > Some might argue that this violates the Internet creed: "Be conservative in > what you supply, and liberal in what you accept." I can live with that: > the consequences of the second half of that creed have led to intolerable > results in the quality and usability of the data on the Net. Furthermore, > if you want to serve up ill-formed dogshit, this will presumably remain > possible, because: "text/html means never having to say you're sorry." We have a very attractive series of tools now, each with their rols: HTML XML SGML Easy, universal, relies Simple, accurate, Very powerful, robust on the human brain for tailored for the WWW Unlimited in its scope processing Aimed at machines There is an important role for each. If you want to carry a poorly defined message to a human, HTML is appropriate. If you want to manage complex documents SGML is essential. Readers of c.t.s may have seen the discussion of 'OMITTAG considered obsolete'. I confess until I saw this discussion, I had taken the same view, but I'm convinced otherwise now :-). SGML has many roles that XML cannot fill (until the machines take over). What we have to do is show people that XML has vast roles that HTML can never fill. P. > Cheers, Tim Bray > tbray@textuality.com http://www.textuality.com/ +1-604-708-9592 > > -- Peter Murray-Rust, domestic net connection Virtual School of Molecular Sciences http://www.vsms.nottingham.ac.uk/
Received on Saturday, 19 April 1997 04:51:46 UTC