- From: Tim Bray <tbray@textuality.com>
- Date: Sun, 20 Apr 1997 10:44:36 -0700
- To: w3c-sgml-wg@w3.org
To summarize: I proposed that XML processors be required to stop passing data (other than error notifications) to applications after the first violation of well-formedness. Lots of people disagree. I will respond to some detail points below, but let me amplify a bit. Some have expressed doubt that the big M and N would really go for this behavior. M is on the list, N soon will be, they can speak for themselves, but I am representing what I understand to be their position. In the following list, perhaps #4 is the most important. 1. HTML only exists to be displayed. A botched display is, well, unfortunate. XML would not exist if it did not presuppose sophisticated processing by the recipient. It seems obvious that such processing cannot be trusted if the document fails to meet even basic standards of well-formedness. Java is fragile enough as it is without trying to build in botched-tag-recovery heuristics. 2. XML docs are expected to support XML Xpointers. Is it sane to try to interpret something like HREF="#HERE,DESCENDENT(1,FOO,BAR,23)" in a non-well-formed doc? 3. XML will be used to support EDI, big-time. The OFX format behind Money/Quicken is about to become XML, it seems. Obviously, a violation in this case will/should/must lead to instant abort of the transaction. 4. The vendors and *serious* information providers at one in wanting to create a non-HTML-like culture of publish-it-right on the Net; one way to do this is to shout, loudly, that there are a few (simple, thank goodness) rules, and they must be obeyed. 5. Clearly, authoring systems and the like must be able to deal, transiently, with non-well-formed documents. Is the input subsystem of an editor required to be an "XML processor"? Why should it? 6. Clearly, it is often the case that XML processors, having encountered errors, should in some cases proceed and report on all the following errors that can be detected. It does *not* therefore follow that they should merrily go on handing data and markup (of dubious reliability) to the application. 7. The vendors want off the "bugwards-compatible" treadmill. We should let them and in fact help them. 8. HTML still exists as an escape hatch for those who only care that their stuff looks decent on most screens most times. Terry Allen wrote: >Oh come on. That would make it be a violation of conformance to output >information about invalid portions of the instance aside from its >initial fault Wrong. It would be legal to go on issuing error messages. > including indexing of what text they contain (malformed >or not) and further malformations. Right. >Real world demands will require >that processors not be so dainty. This is the key point. My perception is that this is exactly what the vendors and information providers want. >Who's going to play cop? And if this were useful, wouldn't it be >useful only if all parsers encountered the same error first? No, because they're allowed to report all the errors they find. >[if you're] ... trying to avoid anyone making any effort at error >recovery, please give up now, immediately. That's exactly what I'm trying. Do you expect malformed PostScript to produce a valid page, or a birth-date of September 48th to be accepted by a payroll system? Another example. It is very difficult indeed to devise an input stream that will cause [gtn]roff to complain. It is very difficult to hand-author any significant amount of PostScript and not have the first few drafts thrown out with syntax errors. Both are successful. >| the consequences of the second half of that creed have led to intolerable >| results in the quality and usability of the data on the Net. >Name three such results. 1. David Siegel. 2. Overlapping FORMs and TABLEs. 3. FRAMESET/NOFRAMES 4. Javascript and CSS hidden in comments 5. Internet explorer uses 15 megabytes of memory James Clark writes: >>Can any application hope to >>accomplish meaningful work in this mode if the document does not even manage >>to be well-formed!?!? >It may very well be able to. For example, most valid SGML documents are >not well-formed XML documents, but an XML parser may well be able to handle >some of the constructs that are legal SGML but not legal XML. That's a good point; I could see an allowed special-case for *valid* SGML documents. >Consider >unquoted attribute values: an XML processor can easily recover from these. I disagree. The reliable presence of quotes makes XML attribute processing dramatically easier than SGML's. Furthermore, <foo attr="val>some more text & markup... 20k before the next quotation-mark will send most parsers (including at least SP and Lark) into gibber-mode. I think an unquoted attribute value is *precisely* the kind of thing that should remove any expectation that a document can be processed reliably by a receiving application, and which vendors should *not* be asked to heuristic (v?) their way around. > requiring that the processor not process anything >thereafter is totally unreasonable. As an implementor my reaction to that >is: No way! I would not be willing to implement such behaviour whatever the >spec says. The processor is not forbidden to proceed, looking for more errors (one of the things that SP is really good at) - it is just forbidden to pretend that everything is OK and go on passing bogus data to the application. You would be within your rights as an implementor to ignore this, but I, as an application builder, would not assign any significant client processing role to a component that insisted on trying to guess the meaning of egregiously broken data. >The XML spec isn't perfect: it has bugs and things that aren't clear. XML >parsers aren't perfect either. The result is there will be cases where one >implementation thinks a document is well-formed, but another doesn't. The >requirement you suggest would be disastrous in this situation. That is a good and troubling point. On the other hand, for the basics - proper tag/attribute syntax, nesting, and so on - it seems pretty likely that well-formedness should be *very* unambiguous. And on balance I am less troubled by the possibility of disagreement between processors, than by another another I-can-process-buggier- documents-than-you rat-race, or the possibility of downloading malformedness-recovery heuristics in every Java applet. Norbert Mikula writes: >What about an XML editor ? Of course, I might be in a situation where >I am, temporarily, in a state where I have no well-formdness. Right, I agree; although an XML editor that produces valid output probably cannot use a pure XML processor, in the sense the spec uses, to read its input anyhow. The whole editing/authoring system, seen globally from the outside, probably qualifies as an XML processor, in that it will not emit documents that are not well-formed. [or I won't use it] >Furthermore error messages only make sense if you know the context >of them. I want to show my user the output of the parser that >shows where exactly the error occured. You can provide the line-number, >of course, but you would have to go back to the original document, >load it non-parsed, so that you can exactly show the user where >the error occured. Exactly. This is exactly what Emacs+SP does, and it seems like the correct behavior to me. >So what I would like to focus this disucssion on is a taxonomy of >error events. If we can come up with a list of errors, the class >they belong to, and possible ways to handle these errors in terms >of actions of the application, I guess many people would be happy. That would make sense, and I could go for that. I am suggesting a simpler solution, in the spirit of XML. Sean McGrath: >I think it is hugely important to say "this is >not well formed" >but overly spartan to then say "so fix it and try again!". Yes, I agree; emit all the helpful error messages, the more the better - just don't go on, at the same time, passing data to the application as though nothing's wrong. Peter Murray-Rust: >My basic tenet is that an XML document is either WF or it isn't. If there >is one error, then the result is a null document. If that isn't true then I >think we lose a large number of people who see XML as a robust and >reliable way of passing information. I wouldn't go that far... I'd say "a null document plus error messages". To make things simple, I wouldn't even say "a null document", because that would forbid the processor from giving the app *anything* until he'd read the whole thing, to make sure it's WF. Peter Flynn: >[M & N] have no-one but themselves to blame for this. The facts of the >situation were explained to them ad nauseam in person and over the >wires long before the Web passed the point of inflection on its growth >curve. >Before we expend a large amount of effort on this, can someone confirm >that it stands some chance of being listened to. Yes; and they have learned by experience. I wouldn't be advancing this is I wasn't PRETTY DAMN SURE that this is exactly what these people want. I dunno, the forgiving behavior of N&M may in the big picture have been the right thing to do; perhaps an important part of the success of the Web. I.e. we may have a terrible hangover now, but we had fun last night so to speak. >Joe and Jill Homepage are not likely to give the proverbial tinker's cuss >whether their documents are well-formed or not, if the browsers are as >forgiving and tolerant as N or M. The browsers are going to have to offer >significant new features to compensate for the penalty of having the >parser gag on invalid syntax. Joe & Jill should go on using HTML. I myself will probably go on using HTML for all sorts of quick-post things. The significant new features (most of which involve smarts on the client to process richer data) require basically sound data; seems like a good trade-off to me. Sean McGrath again: >I was simply making the point that the likes of > nsgmls foo.sgm | grep -c "^(BAR$" >can be a useful thing to do even if foo.sgm markup contains errors. Sure; but if you replace "grep -c" with some fancy java applet that does a business-critical application, this is no longer useful but highly dangerous. -Tim
Received on Sunday, 20 April 1997 13:45:33 UTC