Re: Request for clarification on Canonical XML from Joseph Reagle on 2003-07-28 (w3c-rdfcore-wg@w3.org from July 2003)

From: Joseph Reagle <reagle@w3.org>
Date: Mon, 28 Jul 2003 11:35:26 -0400
To: Martin Duerst <duerst@w3.org>, w3c-ietf-xmldsig@w3.org
Cc: w3c-i18n-ig@w3.org, w3c-rdfcore-wg@w3.org, "Peter F. " Patel-Schneider <pfps@research.bell-labs.com>
Message-Id: <200307281135.26082.reagle@w3.org>

On Thursday 24 July 2003 16:04, Martin Duerst wrote:
> The canonical form of an XML document is physical representation of the
> document produced by the method described in this specification. The
> changes are summarized in the following list:

Hi Martin, had this issue come up while we were writing the spec I'm 
confident we could have provided the clarity, or maybe even an additional 
definition of a "canonical character sequence form" as Graham suggested, 
that you are seeking. However, I think it would be inappropriate to do such 
a definition now, and I'm not sure how to even add a "note" as an erratum. 
It doesn't quite fit into "a Caveat where subsequent experience has shown 
that a recommendation of the specification was incorrect or needs further 
qualification." [1]

I don't object to the spirit of your text, and have tweaked it below:

[[[
Note: Canonical XML is an octet sequence resulting from characters, from the 
UCS character domain, encoded in UTF-8. This is necessary for the purposes 
of XML Signature and other applications. However, some applications may 
require a canonical form of XML that is a sequence of characters, without 
concern for its encoding and representation as octets. As an example, it 
may be appropriate to choose UTF-16 rather than UTF-8 as the encoding of an 
API in a programming language using UTF-16 to represent Unicode strings, 
such as Java or Python. In such cases, applications are not prohibited from 
defining and using a canonical character sequence that corresponds to the 
characters of a Canonical XML instance.
]]]

I'm not sure if this is any better, and I'm not confident it should be an 
erratum, but perhaps you could use this thread in your discussions with 
others about whether the octet representation is really needed?

[1] http://www.w3.org/2001/03/C14N-errata

Received on Monday, 28 July 2003 11:35:29 UTC