unreachable symbols in the XML grammar from Michael.Goulish@SoftwareAG-USA.com on 2000-05-23 (xml-editor@w3.org from April to June 2000)

From: <Michael.Goulish@SoftwareAG-USA.com>
Date: Tue, 23 May 2000 11:57:11 -0400
To: xml-editor@w3.org
Cc: Mike.Champion@SoftwareAG-USA.com
Message-ID: <B48FCF558294D311ADD90080C8FAF3F85064AE@sunshine.ptg.sagus.com>

Greetings to the XML-Editor!


I recently implemented a parser for the full
XML grammar in C.  I may be unusual in that I
had no experience in XML when I started this
project, but over 15 years experience as a
full-time programmer and before that an MS
in computer science. 

I thought you might be interested to hear about
which parts of the XML 1.0 spec confused me the 
most.  (I reserve the right to find other parts
confusing in the future.)




1. Not all the productions belong to the grammar.
-------------------------------------------------

   In my world, grammars have a single start symbol.
   If you represent a grammar as a tree, you *always*
   see a connected tree.  That means you can start 
   with the start symbol and, through some series of 
   steps, reach any other symbol in the grammar.
   Any symbol that's not reachable in this way can
   be (and should be) discarded.

   Starting from production "[1] document" I believe 
   that the following symbols are unreachable in the 
   XML 1.0 grammar:

     [6]  Names
     [8]  Nmtokens
     [30] extSubset
     [33] LanguageID
     [78] extParsedEnt
     [79] extPE
  
   I believe that, if the errata are taken into account
   (and they should be rolled into the main document 
   instantaneously) then all of these productions are 
   used at least in Validity Constraints.  But then -- 
   they're not part of the grammar in the same sense
   that the other productions are, and as their membership
   in the numbering scheme would seem to imply.

   It's odd and confusing to not be able to understand
   the grammar on at least a purely syntactic level without 
   reading the accompanying prose.

   I would like to see unreachable symbols clearly marked
   in some way -- perhaps given a different numbering scheme
   to show that they are not part of the "main" grammar in 
   the same way as other productions are.  Maybe like 
   VC-1, VC-2, etc.



2. There is no number 2.  
----------------------------

   ( I guess I'll limit this to my main point for now.  
   Maybe more later. )





Thanks very much for your attention, and I'd 
be very interested to hear your thoughts --


-------------------------------- Mick .

Received on Tuesday, 23 May 2000 11:56:59 UTC