- From: Richard L. Goerwitz III <richard@goon.stg.brown.edu>
- Date: Wed, 15 Apr 1998 10:39:32 -0400
- To: xml-editor@w3.org
Here are some comments on various parts of the XML 1.0 standard. I have gone over the spec lightly, and see various typos and seeming problems. If these problems vanish on closer inspection, then you'll still be able to glean from my comments an indication of how an average programmer will react when confronted with the standard. Richard Goerwitz Scholarly Technology Group ==================================================================== 2.2 Why is [#x10000-#x10FFFF] specified here? It's not Unicode. In fact, it's representable only via UCS-4 and UTF-16 (in the latter case, with two-byte sequences). Ditto for \xac00-\xd7a3 in BaseChars (B.). 2.3 If you're going to go Unicode, why are you defining spaces only in the ASCII range? What about 2000-200F? Why in heaven's name have Name and Nmtoken been defined and used in such a way that a lexical analyzer can't determine which is which? Name is a subset of Nmtoken, and the question of which is needed (or valid) at any given point is only syntactically determinable. Now if we are using standard lexical analyzers, we have to order the Name recognizer first, then put the Nmtoken recognizer afterwards. We also must alter all the productions that use Nmtoken to accept Name tokens, too. This is an unnecessary complexity - seemingly not in keeping with the simplistic ideals of XML. Note that I have already sent off to Michael a suggestion that the strings PUBLIC SYSTEM EMPTY ANY CDATA ID IDREF ENTITY NDATA plus all the #words (#PCDATA) be treated as reserved words, and not allowed to match the Name and Nmtoken productions (#PCDATA matches the Nmtoken production, if my brief reading is correct). This will simplify tokenizing, and will make XML files themselves clearer (seeing as the function of these keywords will become static). 2.8 What exactly are all those whitespace tokens doing in the syntax spec for DOCTYPE declarations (or, for that matter, in many other productions)? I don't see the point here! ;-) Whitespace beyond special sequences like comments and strings, is not significant in most soundly designed languages. The tokenizer simply uses it to split up the input. If things should be together, with no whitespace, then define the tokens without whitespace. Define sequences in which whitespace is significant (e.g., in quoted strings) in such a way that they are unambiguously recognizable. Also, for the extsubset production: It's a hanging rule. It has no parent. Not terribly helpful for implementors. They have to know, telepathically, that in fact some of the external entities in the DOCTYPE declaration refer to text streams that must be read in and then parsed according to this production. If you intend for implementors to switch input streams here, then please explain this. You are essentially creating an implicit #include mechanism in this case - and I think that this is, in prin- ciple, a bad idea (if you need a preprocessor, then explicitly define one). And you'll need to explain to implementors that you are essentially using a different grammar once the input streams are switched (parsed entities function differently), which means having our lexical analyzers insert invisible markers into the token stream to tell our parsers to start behaving differently. You could solve this whole problem by just treating parsed enti- ties the same way in the external and internal DTD subsets. The cost here in grammatical absurdity just isn't worth the benefit. If parsed entities are so hard to process (except between markup) in the internal DTD subset, then they have no business in the external one. Remember: XML is supposed to be easy to process overall (not just on one level). 3.3.1 Rule 58 is malformed. It appears to be missing a trailing ')' token. 4.3.3 You use the phrase, "In the absence of information provided by an exter- nal transport protocol...." This means that we can override informa- tion contained (or not contained) in the XML file itself via some ex- ternal mechanism. If the goal is to make XML easy to process, you need to remove this phrase. Why? Because we must now either make XML pro- cessors also become external URI resolvers, with MIME-type detection systems. Or else we must define some external interface through which the XML parser communicates with the mechanisms responsible for external transport. It is much simpler and more in keeping with the spirit of the XML spec itself to make XML files completely self-defining. If they say that they are UTF-8, then (external transport issues aside) it is simply an error if they show up as UCS-2. Yes, if somebody mistakenly converts the XML file to the wrong format, this is a problem (the XML decl may not match the encoding). But that is a problem for the conversion software and transport mechanisms used. XML shouldn't get into the business of worrying about such things. Appendix B - There is a typo in the Digit spec: #x0BE7-#x0BEF should be #x0BE6-#x0BEF Richard Goerwitz
Received on Wednesday, 15 April 1998 10:40:08 UTC