RE: Possible XML and C14N errata from John Boyer on 2003-02-21 (xml-editor@w3.org from January to March 2003)

From: John Boyer <JBoyer@PureEdge.com>
Date: Fri, 21 Feb 2003 11:29:01 -0800
To: <w3c-ietf-xmldsig@w3.org>
Cc: <FYergeau@alis.com>, <xml-editor@w3.org>
Message-ID: <7874BFCCD289A645B5CE3935769F0B52452A7B@tigger.pureedge.com>

Hi Francois,

While rule 14 of XML is technically correct, I respectfully request it be clarified. It is very clever to redefine standard regular expression syntax so that it has an implicit additional restriction that affects the interpretation of rule 14, but

1) the experienced developer already knows BNF and regular expressions, so he/she is less likely to read the section because it requires the assumption that he/she is not too familiar with BNF, a contradiction.

2) the clever adjustment is some 40 or 50 pages from the rule for CharData.

3) If one follows a reasonable path through the XML spec to find out what is allowable content, one views rule 14 for CharData and rule 15 for Comment on the same screen, the latter of which takes the trouble of using the Char production. The reasonable conclusion is that if the XML spec had meant to restrict content to instances of Char, then Char would have been used there as well.

In essence, the redefinition of a standard notation, the fact that it is redefined at a distant point *later* in the spec, and the fact that Char is inconsistently used in the BNF rules of the spec combine to form a recipe for misunderstandings. Please reconsider the idea of consistently using Char in several places in the spec, especially the CharData.

I am also concerned and would like your opinion regarding a possible erratum to the XPath data model, which would percolate into the canonicalization spec. The problem is that, although XML technically does not allow most of the characters in the range 0x00 to 0x1f, one can still place them into the input stream using character references. Yet the XPath data model says that its strings are composed of characters as they are defined in XML.

So 'technically' C14N is OK because you seemingly can't create an XPath data model for the offending class of XML documents (those containing character references such as &#x10;). But I don't like 'technically' correct because I'm sure few people realize that there seemingly a class of XML documents for which there is no canonicalization because there is no XPath data model. Would you agree?

If so, then errata in Xpath and C14N would seem to be needed. The question then becomes which of the following corrective measures should be taken?

1) XML documents containing these character references are not supported, or
2) Xpath supports characters having all octets in the range 0x00-0x1f or possible 0x01-0x1f, and C14N writes them as character references.

John Boyer, Ph.D.
Senior Product Architect
PureEdge Solutions Inc.

Received on Friday, 21 February 2003 14:29:32 UTC