XML standard comments

Here are some comments on various parts of the XML 1.0 standard.
I have gone over the spec lightly, and see various typos and seeming
problems.

If these problems vanish on closer inspection, then you'll still
be able to glean from my comments an indication of how an average
programmer will react when confronted with the standard.

Richard Goerwitz
Scholarly Technology Group

====================================================================

2.2

  Why is [#x10000-#x10FFFF] specified here?  It's not Unicode.  In
  fact, it's representable only via UCS-4 and UTF-16 (in the latter
  case, with two-byte sequences).  Ditto for \xac00-\xd7a3 in
  BaseChars (B.).

2.3

  If you're going to go Unicode, why are you defining spaces only in
  the ASCII range?  What about 2000-200F?

  Why in heaven's name have Name and Nmtoken been defined and used in
  such a way that a lexical analyzer can't determine which is which?
  Name is a subset of Nmtoken, and the question of which is needed (or
  valid) at any given point is only syntactically determinable.  Now
  if we are using standard lexical analyzers, we have to order the
  Name recognizer first, then put the Nmtoken recognizer afterwards.
  We also must alter all the productions that use Nmtoken to accept
  Name tokens, too.

  This is an unnecessary complexity - seemingly not in keeping with
  the simplistic ideals of XML.

  Note that I have already sent off to Michael a suggestion that the
  strings

   PUBLIC
   SYSTEM
   EMPTY
   ANY
   CDATA
   ID
   IDREF
   ENTITY
   NDATA
   plus all the #words (#PCDATA)

  be treated as reserved words, and not allowed to match the Name
  and Nmtoken productions (#PCDATA matches the Nmtoken production,
  if my brief reading is correct).

  This will simplify tokenizing, and will make XML files themselves
  clearer (seeing as the function of these keywords will become
  static).

2.8

  What exactly are all those whitespace tokens doing in the syntax
  spec for DOCTYPE declarations (or, for that matter, in many other
  productions)?  I don't see the point here! ;-)  Whitespace beyond
  special sequences like comments and strings, is not significant in
  most soundly designed languages.  The tokenizer simply uses it to
  split up the input.

  If things should be together, with no whitespace, then define the
  tokens without whitespace.  Define sequences in which whitespace
  is significant (e.g., in quoted strings) in such a way that they
  are unambiguously recognizable.

  Also, for the extsubset production: It's a hanging rule.  It has no
  parent.  Not terribly helpful for implementors.  They have to know,
  telepathically, that in fact some of the external entities in the
  DOCTYPE declaration refer to text streams that must be read in and
  then parsed according to this production.

  If you intend for implementors to switch input streams here, then
  please explain this.  You are essentially creating an implicit
  #include mechanism in this case - and I think that this is, in prin-
  ciple, a bad idea (if you need a preprocessor, then explicitly
  define one).  And you'll need to explain to implementors that you
  are essentially using a different grammar once the input streams
  are switched (parsed entities function differently), which means
  having our lexical analyzers insert invisible markers into the
  token stream to tell our parsers to start behaving differently.

  You could solve this whole problem by just treating parsed enti-
  ties the same way in the external and internal DTD subsets.  The
  cost here in grammatical absurdity just isn't worth the benefit.
  If parsed entities are so hard to process (except between markup)
  in the internal DTD subset, then they have no business in the
  external one.  Remember:  XML is supposed to be easy to process
  overall (not just on one level).

3.3.1

  Rule 58 is malformed.  It appears to be missing a trailing ')' token.

4.3.3

  You use the phrase, "In the absence of information provided by an exter-
  nal transport protocol...."  This means that we can override informa-
  tion contained (or not contained) in the XML file itself via some ex-
  ternal mechanism.  If the goal is to make XML easy to process, you need
  to remove this phrase.  Why?  Because we must now either make XML pro-
  cessors also become external URI resolvers, with MIME-type detection
  systems.  Or else we must define some external interface through which
  the XML parser communicates with the mechanisms responsible for external
  transport.

  It is much simpler and more in keeping with the spirit of the XML spec
  itself to make XML files completely self-defining.  If they say that
  they are UTF-8, then (external transport issues aside) it is simply
  an error if they show up as UCS-2.

  Yes, if somebody mistakenly converts the XML file to the wrong format,
  this is a problem (the XML decl may not match the encoding).  But that
  is a problem for the conversion software and transport mechanisms used.
  XML shouldn't get into the business of worrying about such things.

Appendix B

  - There is a typo in the Digit spec:  #x0BE7-#x0BEF should be #x0BE6-#x0BEF

Richard Goerwitz

Received on Wednesday, 15 April 1998 10:40:08 UTC