W3C home > Mailing lists > Public > w3c-rdfcore-wg@w3.org > July 2003

Re: Request for clarification on Canonical XML

From: Graham Klyne <gk@ninebynine.org>
Date: Fri, 25 Jul 2003 10:07:35 +0100
Message-Id: <>
To: Martin Duerst <duerst@w3.org>, w3c-ietf-xmldsig@w3.org
Cc: w3c-i18n-ig@w3.org, w3c-rdfcore-wg@w3.org, "Peter F. " Patel-Schneider <pfps@research.bell-labs.com>

I think what would help here would be to define a (referencable) canonical 
character sequence form, and use that to define the canonical octet 
sequence representation.


At 16:04 24/07/03 -0400, Martin Duerst wrote:

>Hello Joseph, dear XML Signature specialists,
>I would like to request a clarification on Canonical XML
>At http://www.w3.org/TR/2001/REC-xml-c14n-20010315#Terminology,
>Canonical XML says:
> >>>>
>The canonical form of an XML document is physical representation of the 
>document produced by the method described in this specification. The 
>changes are summarized in the following list:
>* The document is encoded in UTF-8
> >>>>
>There are numerous applications (parser testing, digital signatures,
>encryption) where it is important to have an actual physical representation
>for simple octet comparison or for input to cryptographic algorithms
>that usually take octet streams as inputs.
>However, it has recently come to my attention that there are also some
>attempts to use Canonical XML (or Exclusive XML Canonicalisation, which
>inherits this aspect of its definition from Canonical XML) in other
>situations, such as purely abstract modeling and comparison of XML
>documents or XML fragments, or for API definitions.
>For abstract modeling, the encoding in UTF-8 irrelevant and confusing.
>For API definitions, specifying the encoding is crucial, but it may
>be counterproductive to use UTF-8 in a context where UTF-16 is
>widely used (e.g. Java, Python).
>Although not appropriate in these cases, it seems to be difficult for
>users of Canonical XML to abstract them from UTF-8 where necessary.
>I therefore propose to add some clarification. As a first actual
>text proposal (may need some additional work), I propose to
>add a note at the end of Section 1.1, Terminology:
>Note: Canonical XML is defined here in terms of a physical (octet-based)
>representation. This is appropriate for many applications, ranging from
>digital signatures to parser testing. However, there are cases where a
>physical representation is not needed, and there are cases where another
>physical representation is appropriate. As an example, it may be
>aproriate to choose UTF-16 rather than UTF-8 as the encoding of an
>API in a programming language using UTF-16 to represent Unicode strings,
>such as Java or Python. In such cases, users of Canonical XML should
>abstract from the physical character encoding if they note this
>I'm sure there is a better way to word this.
>Regards,    Martin.

Graham Klyne
PGP: 0FAA 69FF C083 000B A2E9  A131 01B9 1C7A DBCA CB5E
Received on Friday, 25 July 2003 05:12:48 UTC

This archive was generated by hypermail 2.4.0 : Friday, 17 January 2020 20:24:24 UTC