W3C home > Mailing lists > Public > xml-editor@w3.org > January to March 2003

Re: Possible XML and C14N errata

From: Joseph Reagle <reagle@w3.org>
Date: Tue, 25 Feb 2003 18:30:05 -0500
To: "John Boyer" <JBoyer@PureEdge.com>, <w3c-ietf-xmldsig@w3.org>
Cc: <FYergeau@alis.com>, <xml-editor@w3.org>, Martin Dürst <duerst@w3.org>
Message-Id: <200302251830.05657.reagle@w3.org>

On Friday 21 February 2003 14:29, John Boyer wrote:
> So 'technically' C14N is OK because you seemingly can't create an XPath
> data model for the offending class of XML documents (those containing
> character references such as &#x10;).  But I don't like 'technically'
> correct because I'm sure few people realize that there seemingly a class
> of XML documents for which there is no canonicalization because there is
> no XPath data model.  Would you agree?
> 1) XML documents containing these character references are not supported,

After thinking about it further and speaking with Martin, I think this is 
the case. There is no canonicalization for an XML instance with a character 
such as &#x10, because there is no XPath node set for it, because no XPath 
processor would ever parse such an instance, because it's not well formed 
XML.

The expression "[^<&]* - ([^<&]* ']]>' [^<&]*)" is rather baroque on its 
face, and as John notes it's very weird that we have to read an augmented 
BNF grammar which itself references back to a production to understand the 
CharData production. So, granted, it is ugly and confusing, but is it 
causing sufficient problems that it merits an erratum? Given that Xerces 
was balking on those characters, it seems like it at least got it right.

I suspect that XML 1.0 is so old now <smile/> that the XML authors feel most 
implementations have already stubbed their toe and it's best to look to the 
future... Martin pointed out to me that XML 1.1 is supposed to ameliorate 
these problems with:
  http://www.w3.org/TR/xml11/#sec4.1
  Change the Well-formedness constraint: Legal Character to read:
  Characters referred to using character references must either match the
  production for Char, or be one of the ISO control characters in the 
  ranges [#x1-#x1F] or [#x7F-#x9F]. 
Received on Tuesday, 25 February 2003 18:30:44 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Monday, 7 December 2009 10:59:32 GMT