Possible XML and C14N errata from John Boyer on 2003-02-14 (xml-editor@w3.org from January to March 2003)

From: John Boyer <JBoyer@PureEdge.com>
Date: Fri, 14 Feb 2003 13:15:28 -0800
To: <xml-editor@w3.org>
Cc: <w3c-ietf-xmldsig@w3.org>
Message-ID: <7874BFCCD289A645B5CE3935769F0B52452A3F@tigger.pureedge.com>

There seems to be some difficulty regarding inappropriate handling of characters in the range [0x00-0x1f] - {0x09, 0x0A, 0x0D}.

If these characters are entered directly into element content, the Xerces parser reports illegal characters.  This appears to be because the XML rule named 'Char' forbids these characters.

On the other hand, the XML rule for element 'content' refers to 'CharData', which only forbids the use of less-than (<) and ampersand (&) in character content.  The canonicalization rule for text node processing was based on the CharData rule, so it is possible to get a correct c14n program to write data that Xerces cannot read and that is possibly not well-formed XML.  

If the 'Char' restrictions are intended to be applied to 'CharData', then rule [14] for CharData needs to look more like rule [15] for Comment, which appropriately uses Char.  Otherwise, the inclusion in CharData of characters forbidden by Char appears to be intentionally permitted by the XML spec.  Here is one way to rewrite the rule:

[14]    CharData    ::=    (Char - [<&])* - ((Char - [<&])* ']]>' (Char - [<&])*) 

If there is an erratum that rewrites rule 14 in a manner similar to the rule above (such that CharData reflects the restrictions of Char), then the C14N Recommendation will need an erratum to the processing model for text nodes.  A note will then be needed for XML DSig and exclusive C14N as well.

John Boyer, Ph.D.
Senior Product Architect
PureEdge Solutions Inc.

Received on Friday, 14 February 2003 16:16:03 UTC