RE: Possible XML and C14N errata from John Boyer on 2003-02-25 (xml-editor@w3.org from January to March 2003)

From: John Boyer <JBoyer@PureEdge.com>
Date: Tue, 25 Feb 2003 15:57:34 -0800
To: "Joseph Reagle" <reagle@w3.org>, <w3c-ietf-xmldsig@w3.org>
Cc: <FYergeau@alis.com>, <xml-editor@w3.org>, Martin Dürst <duerst@w3.org>
Message-ID: <7874BFCCD289A645B5CE3935769F0B52452A8F@tigger.pureedge.com>

Hi Joseph,

Our implementation of C14N still runs into a bit of trouble because, although you can't get the Xerces parser to read the bytes, there is apparently nothing wrong with making a DOM call that sets the illegal bytes into the content.  I already have to do other things to enforce the Xpath data model on the DOM tree, so this is something I can take up in implementation.  

I would also agree that an erratum is probably not needed given the change in XML 1.1 cited by Martin as well as his point that character references to the illegal characters also violate well-formedness.  This last point also clears up my concerns about Xpath and C14N.

Thanks for the clarifications!

John Boyer

-----Original Message-----
From: Joseph Reagle [mailto:reagle@w3.org]
Sent: Tuesday, February 25, 2003 3:30 PM
To: John Boyer; w3c-ietf-xmldsig@w3.org
Cc: FYergeau@alis.com; xml-editor@w3.org; Martin Dürst
Subject: Re: Possible XML and C14N errata

On Friday 21 February 2003 14:29, John Boyer wrote:
> So 'technically' C14N is OK because you seemingly can't create an XPath
> data model for the offending class of XML documents (those containing
> character references such as &#x10;).  But I don't like 'technically'
> correct because I'm sure few people realize that there seemingly a class
> of XML documents for which there is no canonicalization because there is
> no XPath data model.  Would you agree?
> 1) XML documents containing these character references are not supported,

After thinking about it further and speaking with Martin, I think this is 
the case. There is no canonicalization for an XML instance with a character 
such as &#x10, because there is no XPath node set for it, because no XPath 
processor would ever parse such an instance, because it's not well formed 
XML.

The expression "[^<&]* - ([^<&]* ']]>' [^<&]*)" is rather baroque on its 
face, and as John notes it's very weird that we have to read an augmented 
BNF grammar which itself references back to a production to understand the 
CharData production. So, granted, it is ugly and confusing, but is it 
causing sufficient problems that it merits an erratum? Given that Xerces 
was balking on those characters, it seems like it at least got it right.

I suspect that XML 1.0 is so old now <smile/> that the XML authors feel most 
implementations have already stubbed their toe and it's best to look to the 
future... Martin pointed out to me that XML 1.1 is supposed to ameliorate 
these problems with:
  http://www.w3.org/TR/xml11/#sec4.1
  Change the Well-formedness constraint: Legal Character to read:
  Characters referred to using character references must either match the
  production for Char, or be one of the ISO control characters in the 
  ranges [#x1-#x1F] or [#x7F-#x9F].

Received on Tuesday, 25 February 2003 18:58:39 UTC