RE: Possible XML and C14N errata from John Boyer on 2003-02-27 (w3c-ietf-xmldsig@w3.org from January to March 2003)

From: John Boyer <JBoyer@PureEdge.com>
Date: Wed, 26 Feb 2003 16:14:47 -0800
To: "Martin Duerst" <duerst@w3.org>, "Joseph Reagle" <reagle@w3.org>, <w3c-ietf-xmldsig@w3.org>
Cc: <FYergeau@alis.com>, <xml-editor@w3.org>
Message-ID: <7874BFCCD289A645B5CE3935769F0B52452A94@tigger.pureedge.com>

Hi Martin,

I understand the performance reasons.

However, I remain skeptical about the hand-waving around XML being old and most implementations having already stubbed their proverbial toes on this issue.

For example, I'm currently playing with the Xerces code base from Oct. 2002, and they still seem to be having problems around this issue.

For example, if I place byte 0x82 from the ANSI code page into content, it gets translated to Unicode 0x201A, which our C14N implementation then encodes with the proper 3 byte UTF-8 sequence.  

But when Xerces reads the UTF-8, it reconstitutes the Unicode 0x201A, then throws an error saying that 0x1A is illegal character content.  The behavior for 0x85 is even better.  The Unicode is 0x2026, and 0x26 is the byte code for the ampersand.  If you feed the UTF-8 for 0x2026 to Xerces, it complains that you have an invalid entity reference.

Clearly, they translate to Unicode, but then attempt to enforce this character content rule on the byte array containing the Unicode, which is quite wrong.  Amazingly, if I use character references such as &#x2026;, then no errors occur.

I have not yet checked whether the Feb. 2003 Xerces code base has fixed this problem, but my point is only that troubles with this CharData issue are more recent than many seem to believe (e.g. a prominent XML parser still has problems with this as of Oct. 2002).  In the next few days, I'll know whether the latest Xerces still has this problem...

John Boyer, Ph.D.
Senior Product Architect
PureEdge Solutions Inc.

-----Original Message-----
From: Martin Duerst [mailto:duerst@w3.org]
Sent: Wednesday, February 26, 2003 10:03 AM
To: John Boyer; Joseph Reagle; w3c-ietf-xmldsig@w3.org
Cc: FYergeau@alis.com; xml-editor@w3.org
Subject: RE: Possible XML and C14N errata

At 15:57 03/02/25 -0800, John Boyer wrote:
>Hi Joseph,
>
>Our implementation of C14N still runs into a bit of trouble because, 
>although you can't get the Xerces parser to read the bytes, there is 
>apparently nothing wrong with making a DOM call that sets the illegal 
>bytes into the content.  I already have to do other things to enforce the 
>Xpath data model on the DOM tree, so this is something I can take up in 
>implementation.

There are lots of other ways to produce illegal XML content using
the DOM. This was done for performance reasons.

>From: Joseph Reagle [mailto:reagle@w3.org]

>The expression "[^<&]* - ([^<&]* ']]>' [^<&]*)" is rather baroque on its
>face, and as John notes it's very weird that we have to read an augmented
>BNF grammar which itself references back to a production to understand the
>CharData production.

Yes, it's not easy to understand. But there are a lot of things
in the BNF used in the XML REC that don't turn up in your everyday
BNF, so I'd guess that every implementer at one point or another
will have to check it anyway.

Regards,   Martin.

Received on Wednesday, 26 February 2003 19:15:33 UTC