XML 1.0 BNF error/issues from Chris King on 2003-02-04 (xml-editor@w3.org from January to March 2003)

From: Chris King <chris.king@senet.com.au>
Date: Mon, 3 Feb 2003 23:01:59 -0500 (EST)
To: xml-editor@w3.org
Message-ID: <20030204143241.A12276@kf>
Dear XML specification editors,

I have been looking through the XML 1.0 2nd edition recommendation (6-Oct-2000)
and its associated errata (up to E41 as of 2002-09-18).  I have come across an
error in the EBNF production [65] and wish also to vote on some style changes
to some other productions.



----------------------AAAA----------------------

The current production [65] states:

	Ignore ::= Char* - (Char* ('<![' | ']]>') Char*)

...which is in error because it is equivalent to:

	Ignore ::= ( Char* - (Char* '<![' Char*) )
	         | ( Char* - (Char* ']]>' Char*) )

This represents all sequences of characters excluding those sequences that
contain both the string '<![' AND the string ']]>'.  I'm sure that you
intended it to represent all sequences of characters excluding those containing
the string '<![' OR the string ']]>' OR both.  My suggested replacement is:

	Ignore ::= Char* - (Char* '<![' Char*) - (Char* ']]>' Char*)

I could probably come up with some formal set-algebra to prove this really is
an error if needs be.



----------------------BBBB----------------------

The current production [15] (well-supported by surrounding text) states:

	Comment	::= '<!--' ((Char - '-') | ('-' (Char - '-')))* '-->'

But, if it were restated as:

	Comment	::= '<!--' ( (Char* - (Char* '--' Char*)) (Char - '-')
	                   )? '-->'

...then the expression patterning would be more consistent with other character
strings that exclude double-character sequences, and the the need for a
non-hyphen character just before the '-->' terminator is more easily deduced.



----------------------CCCC----------------------

Several productions make use of the notation for a set of characters with
some forbidden members:

	[^abc]

...as described in section 6 - Notation.  It is in this section that there is
the only link between the `forbidden' notation's global set of characters (from
which nominated characters are excluded), and the production [2] Char.

I guess you could say this is an obvious association, but I feel that its use
is unnecessary and inconsistent with other exclusion-style notations.  Here's
a vote for some editorial changes.

First up, the introduction of another symbol makes the definition of
[14] CharData crystal clear:

[14]
	CharData ::= DataChar* - (DataChar* ']]>' DataChar*)

[14a]
	DataChar ::= Char - '<' - '&'

...and to complete the edits:

[9]
	EntityValue ::=
	      '"' ( (Char - '%' - '&' - '"') | PEReference | Reference )* '"'
	    | "'" ( (Char - '%' - '&' - "'") | PEReference | Reference )* "'"

[10]
	AttValue ::= '"' ( (DataChar - '"') | Reference )* '"'
	           | "'" ( (DataChar - "'") | Reference )* "'"

[11]
	SystemLiteral ::= '"' (Char - '"')* '"'
	                | "'" (Char - "'")* "'"

The documented notation [^a-z],[^#xN-#xN] is unnecessary as it isn't used
(and neither should [^abc] ;-).



----------------------DDDD----------------------

There is an extra set of parentheses in [20] CData that aren't doing anything,
and two extra sets in [11] SystemLiteral that aren't doing much (adjacent
productions seem to manage without them).



----------------------EEEE----------------------

There aren't any constraints given on [10] AttValue, but several in each of the
two places that it is used ([41] Attribute and [60] DefaultDecl).

Is it appropriate to tie common contraints onto [10]?

[WFC: No < in Attribute Values] is clearly in common,

[VC: Attribute Value Type] and [VC: Attribute Default Legal] ALMOST say
the same thing.  (The former requires a declaration, already happening in the
latter.)

Why doesn't [WFC: No External Entity References] apply to [60]?


EVEN MORE GENERALLY...

Is it appropriate to make reference to [6] Names and [8] Nmtokens here?
(Original definitions -- before the `re-application' erratum E20.)

I really don't see the point of having [6] and [8] as the only unreferenced
non-root productions the grammar (where the other un-referenced roots are
[1] document, [30] extSubset, and [78] extParsedEnt).  The rest of the grammar
is a lexical description of what is allowed, all other restrictive semantics
that are the responsibility of the XML processor are listed as constraints.
As it is, [10] might be read as letting it all through anyway.



--------------------------------------------

Well that's it.  I'm sorry if I got a bit carried away.  I realise that lots
(if not all) of this is probably useless as far as needing any real action,
but like we're always told, lobbying the politicians does eventually have
some effect.  Besides, whining alone isn't healthy.

With regards,
  Chris King
  (Sun Certified Programmer for Java 2 Platform 1.4)
  Longwood, South Australia
Received on Tuesday, 4 February 2003 10:44:58 UTC